Insurance fraud poses a concern across various sectors, including healthcare, homeownership, and automobile coverage. Its impact extends beyond the financial burden on insurers, affecting even non-fraudulent policyholders.
This analysis zeroes in on fraud within the auto insurance industry in India. The dataset utilized for this project was sourced from Kaggle (https://www.kaggle.com/).
Our objective is to leverage classification models to predict fraudulent auto insurance claims. Various classification models will be evaluated based on their efficacy in accurately predicting instances of actual fraud.
Individuals may not have an interest in every section of this analysis. Specific sections of interest can be directly accessed through the table of contents on the left. For instance, clicking on “Models” will swiftly take one to the classification models section.
The following programs were used for this project.
Python 3.10.10
R 4.2.2 (Specific Visualizations)
import pandas as pd
import pickle
from pandas.api.types import is_numeric_dtype
from pandas.api.types import is_categorical_dtype
import datetime
from datetime import date
from dateutil.relativedelta import relativedelta
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.io as pio
import matplotlib.patches as mpatches
from plotnine import *
import plotnine
import scipy
from scipy import stats
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OrdinalEncoder, LabelEncoder
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.compose import make_column_selector as selector
from sklearn.preprocessing import OneHotEncoder
from sklearn import set_config
from sklearn.metrics import confusion_matrix, classification_report, ConfusionMatrixDisplay, accuracy_score, roc_auc_score, recall_score, RocCurveDisplay, precision_score, f1_score, make_scorer
from sklearn import metrics
from sklearn.inspection import permutation_importance
from sklearn.model_selection import cross_val_score,StratifiedKFold
import time
from io import StringIO
from pyod.models.iforest import IForest
from pyod.models.knn import KNN
from pyod.models.lof import LOF
from mlxtend.frequent_patterns import apriori
from sklearn.ensemble import IsolationForest
from sklearn.tree import DecisionTreeClassifier
The data was downloaded as five individual data sets. We will review each data set for suitability of being merged into one data set.
## ************Train_Claim_p Information************
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 28836 entries, 0 to 28835
## Data columns (total 19 columns):
## # Column Non-Null Count Dtype
## --- ------ -------------- -----
## 0 CustomerID 28836 non-null object
## 1 DateOfIncident 28836 non-null object
## 2 TypeOfIncident 28836 non-null object
## 3 TypeOfCollission 28836 non-null object
## 4 SeverityOfIncident 28836 non-null object
## 5 AuthoritiesContacted 28836 non-null object
## 6 IncidentState 28836 non-null object
## 7 IncidentCity 28836 non-null object
## 8 IncidentAddress 28836 non-null object
## 9 IncidentTime 28836 non-null int32
## 10 NumberOfVehicles 28836 non-null int32
## 11 PropertyDamage 28836 non-null object
## 12 BodilyInjuries 28836 non-null int32
## 13 Witnesses 28836 non-null object
## 14 PoliceReport 28836 non-null object
## 15 AmountOfInjuryClaim 28836 non-null int32
## 16 AmountOfPropertyClaim 28836 non-null int32
## 17 AmountOfVehicleDamage 28836 non-null int32
## 18 AmountOfTotalClaim 28836 non-null int32
## dtypes: int32(7), object(12)
## memory usage: 3.4+ MB
## ************Train_Policy_p Information************
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 28836 entries, 0 to 28835
## Data columns (total 10 columns):
## # Column Non-Null Count Dtype
## --- ------ -------------- -----
## 0 InsurancePolicyNumber 28836 non-null int32
## 1 CustomerLoyaltyPeriod 28836 non-null int32
## 2 DateOfPolicyCoverage 28836 non-null object
## 3 InsurancePolicyState 28836 non-null object
## 4 Policy_CombinedSingleLimit 28836 non-null object
## 5 Policy_Deductible 28836 non-null int32
## 6 PolicyAnnualPremium 28836 non-null float64
## 7 UmbrellaLimit 28836 non-null int32
## 8 InsuredRelationship 28836 non-null object
## 9 CustomerID 28836 non-null object
## dtypes: float64(1), int32(4), object(5)
## memory usage: 1.8+ MB
## ************Train_Demographics_p Information************
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 28836 entries, 0 to 28835
## Data columns (total 10 columns):
## # Column Non-Null Count Dtype
## --- ------ -------------- -----
## 0 CustomerID 28836 non-null object
## 1 InsuredAge 28836 non-null int32
## 2 InsuredZipCode 28836 non-null int32
## 3 InsuredGender 28836 non-null object
## 4 InsuredEducationLevel 28836 non-null object
## 5 InsuredOccupation 28836 non-null object
## 6 InsuredHobbies 28836 non-null object
## 7 CapitalGains 28836 non-null int32
## 8 CapitalLoss 28836 non-null int32
## 9 Country 28836 non-null object
## dtypes: int32(4), object(6)
## memory usage: 1.8+ MB
## **********Traindata_with_Targeet_p Information**********
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 28836 entries, 0 to 28835
## Data columns (total 2 columns):
## # Column Non-Null Count Dtype
## --- ------ -------------- -----
## 0 CustomerID 28836 non-null object
## 1 ReportedFraud 28836 non-null object
## dtypes: object(2)
## memory usage: 450.7+ KB
## ************Train_Vehicle_p Information************
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 115344 entries, 0 to 115343
## Data columns (total 3 columns):
## # Column Non-Null Count Dtype
## --- ------ -------------- -----
## 0 CustomerID 115344 non-null object
## 1 VehicleAttribute 115344 non-null object
## 2 VehicleAttributeDetails 115344 non-null object
## dtypes: object(3)
## memory usage: 2.6+ MB
## *************Train_Vehicle_p First 25 Rows*************
## CustomerID VehicleAttribute VehicleAttributeDetails
## 0 Cust20179 VehicleID Vehicle8898
## 1 Cust21384 VehicleModel Malibu
## 2 Cust33335 VehicleMake Toyota
## 3 Cust27118 VehicleModel Neon
## 4 Cust13038 VehicleID Vehicle30212
## 5 Cust1801 VehicleID Vehicle24096
## 6 Cust30237 VehicleModel RAM
## 7 Cust21334 VehicleYOM 1996
## 8 Cust26634 VehicleYOM 1999
## 9 Cust20624 VehicleMake Chevrolet
## 10 Cust14947 VehicleID Vehicle15216
## 11 Cust21432 VehicleYOM 2002
## 12 Cust22845 VehicleYOM 2000
## 13 Cust9006 VehicleMake Accura
## 14 Cust30659 VehicleYOM 2003
## 15 Cust18447 VehicleMake Honda
## 16 Cust19144 VehicleID Vehicle29018
## 17 Cust26846 VehicleID Vehicle21867
## 18 Cust4801 VehicleYOM 1998
## 19 Cust18081 VehicleYOM 2013
## 20 Cust17021 VehicleMake BMW
## 21 Cust30660 VehicleYOM 2002
## 22 Cust22099 VehicleID Vehicle30877
## 23 Cust33560 VehicleYOM 2011
## 24 Cust17371 VehicleYOM 2001
The data sets train claim, train policy, train demographics, and train with target are ready to be merged into one data set.
Viewing the first twenty-five rows of the Train Vehicle data column VehicleAttribute we can see that it has multiple repeating rows as each customerID is as associated with Vehicle Model, Vehicle Make, Vehicle ID, and Vehicle YOM. The number of rows is 115344 which is four times the rows of the other data sets. This data set will have to be modified before it can be merged with the other data sets. Each level should be an individual feature matching to its corresponding level in the VehicleAtributeDetails feature. This will be accomplished by making the Train Vehicle data set wider. We will spread out the Vehicle Attribute feature so each level will become a feature. This will create a new data set that is shorter and wider.
train_vehicle_wide=Train_Vehicle_p.pivot(index='CustomerID',columns='VehicleAttribute',values='VehicleAttributeDetails').reset_index()
## ************train_vehicle_wide Information************
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 28836 entries, 0 to 28835
## Data columns (total 5 columns):
## # Column Non-Null Count Dtype
## --- ------ -------------- -----
## 0 CustomerID 28836 non-null object
## 1 VehicleID 28836 non-null object
## 2 VehicleMake 28836 non-null object
## 3 VehicleModel 28836 non-null object
## 4 VehicleYOM 28836 non-null object
## dtypes: object(5)
## memory usage: 1.1+ MB
## *************train_vehicle_wide first 50 rows*************
## VehicleAttribute CustomerID VehicleID VehicleMake VehicleModel VehicleYOM
## 0 Cust10000 Vehicle26917 Audi A5 2008
## 1 Cust10001 Vehicle15893 Audi A5 2006
## 2 Cust10002 Vehicle5152 Volkswagen Jetta 1999
## 3 Cust10003 Vehicle37363 Volkswagen Jetta 2003
## 4 Cust10004 Vehicle28633 Toyota CRV 2010
## 5 Cust10005 Vehicle26409 Toyota CRV 2011
## 6 Cust10006 Vehicle12114 Mercedes C300 2000
## 7 Cust10007 Vehicle26987 Suburu C300 2010
## 8 Cust10009 Vehicle12490 Volkswagen Passat 1995
## 9 Cust1001 Vehicle28516 Saab 92x 2004
## 10 Cust10011 Vehicle8940 Nissan Ultima 2002
## 11 Cust10012 Vehicle9379 Ford Fusion 2004
## 12 Cust10013 Vehicle22024 Accura Fusion 2001
## 13 Cust10014 Vehicle3601 Suburu Impreza 2011
## 14 Cust10016 Vehicle7515 Saab 92x 2005
## 15 Cust10017 Vehicle31838 Saab 92x 2005
## 16 Cust10018 Vehicle35954 Toyota 93 2000
## 17 Cust10019 Vehicle19647 Saab 93 2000
## 18 Cust10021 Vehicle37694 Volkswagen Passat 2006
## 19 Cust10022 Vehicle31889 Toyota Highlander 1997
## 20 Cust10023 Vehicle10464 Toyota Highlander 1999
## 21 Cust10024 Vehicle24452 Dodge X5 2001
## 22 Cust10025 Vehicle12734 Dodge X5 2002
## 23 Cust10026 Vehicle14492 Volkswagen Passat 2001
## 24 Cust10027 Vehicle38970 Saab Passat 1995
## 25 Cust10028 Vehicle3996 Honda Accord 2015
## 26 Cust10029 Vehicle12477 Toyota Corolla 2015
## 27 Cust10030 Vehicle34293 Ford Forrestor 2006
## 28 Cust10031 Vehicle33775 Suburu F150 2005
## 29 Cust10032 Vehicle34708 Nissan Pathfinder 2012
## 30 Cust10034 Vehicle26030 Saab 92x 2006
## 31 Cust10035 Vehicle3961 Saab Jetta 2007
## 32 Cust10037 Vehicle38667 Dodge Neon 2012
## 33 Cust1004 Vehicle17051 Chevrolet Tahoe 2014
## 34 Cust10040 Vehicle7284 Audi Wrangler 2007
## 35 Cust10041 Vehicle2119 Jeep A3 2008
## 36 Cust10042 Vehicle7459 Accura A5 1997
## 37 Cust10043 Vehicle6244 Accura RSX 2010
## 38 Cust10044 Vehicle38446 Chevrolet Malibu 1998
## 39 Cust10046 Vehicle3199 Audi A5 2011
## 40 Cust10047 Vehicle13780 Audi A5 2009
## 41 Cust10049 Vehicle35318 Ford F150 2008
## 42 Cust1005 Vehicle26158 Accura RSX 2009
## 43 Cust10051 Vehicle33864 Dodge E400 2014
## 44 Cust10052 Vehicle16314 Honda Legacy 2002
## 45 Cust10053 Vehicle35570 Suburu Legacy 2000
## 46 Cust10054 Vehicle13054 Audi Ultima 2006
## 47 Cust10057 Vehicle23410 Suburu Legacy 2005
## 48 Cust10058 Vehicle24044 BMW 92x 2005
## 49 Cust10059 Vehicle25575 BMW X5 2006
We have taken the data from train vehicle and created a new data set called train vehicle wide. This new data set has four new columns and 28836 rows which now matches the other four data sets. We are now ready to merge all data sets.
fraud=Train_Claim_p.merge(Train_Demographics_p, on="CustomerID")\
.merge(Train_Policy_p, on="CustomerID")\
.merge(train_vehicle_wide, on="CustomerID")\
.merge(Traindata_with_Target_p, on="CustomerID")
We’ll now test to ensure our data joins and transformations have
returned a dataframe.
# Function to check if Data Frame
def check_is_dataframe(df):
assert isinstance(df, pd.DataFrame), f"Error: object is not Data Frame."
print("Object is Data Frame")
check_is_dataframe(fraud)
## Object is Data Frame
## *******************fraud Data Types*******************
## CustomerID object
## DateOfIncident object
## TypeOfIncident object
## TypeOfCollission object
## SeverityOfIncident object
## AuthoritiesContacted object
## IncidentState object
## IncidentCity object
## IncidentAddress object
## IncidentTime int32
## NumberOfVehicles int32
## PropertyDamage object
## BodilyInjuries int32
## Witnesses object
## PoliceReport object
## AmountOfInjuryClaim int32
## AmountOfPropertyClaim int32
## AmountOfVehicleDamage int32
## AmountOfTotalClaim int32
## InsuredAge int32
## InsuredZipCode int32
## InsuredGender object
## InsuredEducationLevel object
## InsuredOccupation object
## InsuredHobbies object
## CapitalGains int32
## CapitalLoss int32
## Country object
## InsurancePolicyNumber int32
## CustomerLoyaltyPeriod int32
## DateOfPolicyCoverage object
## InsurancePolicyState object
## Policy_CombinedSingleLimit object
## Policy_Deductible int32
## PolicyAnnualPremium float64
## UmbrellaLimit int32
## InsuredRelationship object
## VehicleID object
## VehicleMake object
## VehicleModel object
## VehicleYOM object
## ReportedFraud object
## dtype: object
Feature engineering encompasses several essential steps.
Firstly, there is feature creation, where new variables are generated from existing features to enhance both our model and data visualization.
Secondly, feature transformation involves converting features from one representation to another. For instance, we might transform a numerical feature into a categorical type.
Cleaning is a crucial process that entails scrutinizing the features. If something appears amiss with a feature, we can address the issue by either eliminating the problematic values or, in some cases, entirely removing the feature. Null values, for instance, can be handled by replacing them with alternative values, removing data points with null values, or, as previously mentioned, excluding the entire feature.
There are features that are dates though they do not have the correct
data type. We will create a function to transform these features to a
datetime data type.
def convert_to_datetime(df, column_name):
df[column_name]=pd.to_datetime(df[column_name])
convert_to_datetime(fraud_v2, 'DateOfIncident')
convert_to_datetime(fraud_v2, 'DateOfPolicyCoverage')
We’ll now write a function to confirm if the features have been successfully transformed to a datetime data type
def check_is_datetime(df, column_name):
assert pd.api.types.is_datetime64_any_dtype(df[column_name]), f"Error: feature '{column_name}' is not datetime dtype."
print(f"Feature '{column_name}' is datetime dtype")
check_is_datetime(fraud_v2,'DateOfIncident')
## Feature 'DateOfIncident' is datetime dtype
check_is_datetime(fraud_v2,'DateOfPolicyCoverage')
## Feature 'DateOfPolicyCoverage' is datetime dtype
Now that that the features have been transformed to the correct data type, we will now use them to create new features.
fraud_v2["coverageIncidentDiff"]=(fraud_v2["DateOfIncident"]-fraud_v2["DateOfPolicyCoverage"])
fraud_v2["coverageIncidentDiff"]=fraud_v2["coverageIncidentDiff"]/np.timedelta64(1,'Y')
## ************CoverageIncidentDiff************
## count 28836.000000
## mean 13.074582
## std 6.560420
## min -0.054758
## 25% 7.646290
## 50% 13.172071
## 75% 18.617768
## max 25.123035
## Name: coverageIncidentDiff, dtype: float64
fraud_v2['dayOfWeek'] = fraud_v2["DateOfIncident"].dt.day_name()
## *****dayOfWeek Value Counts*****
## Friday 0.15
## Tuesday 0.15
## Thursday 0.14
## Saturday 0.14
## Wednesday 0.14
## Monday 0.14
## Sunday 0.14
## Name: dayOfWeek, dtype: float64
Certain features are numeric yet may better serve our models as categorical. This can be assessed by checking unique values of these features
## ******** Unique Number of Vehicles********
## [3 1 4 2]
## ******** Unique Bodily Injuries********
## [1 2 0]
The above outputs indicate that both NumberOfVehcicles and BodilyInjuries would be best as type categorical. We will create a function that converts numerical data types to categorical. Then the function will be applied to the selected numerical features.
def convert_to_cat(df, column_name):
df[column_name]=df[column_name].astype('category')
convert_to_cat(fraud_v2, 'NumberOfVehicles')
convert_to_cat(fraud_v2, 'BodilyInjuries')
We use a function to confirm the two features have been transformed to a categorical data type.
def check_is_categorical(df, column_name):
assert pd.api.types.is_categorical_dtype(df[column_name]), f"Error: feature '{column_name}' is not categorical dtype."
print(f"Feature '{column_name}' is categorical dtype")
check_is_categorical(fraud_v2,'BodilyInjuries')
## Feature 'BodilyInjuries' is categorical dtype
check_is_categorical(fraud_v2,'NumberOfVehicles')
## Feature 'NumberOfVehicles' is categorical dtype
Both features are now of type category
## *************Incident Time Unique Values*************
## [17 10 22 7 20 18 3 5 14 16 15 13 12 9 19 4 11 1 8 0 6 21 23 2
## -5]
IncidentTime has unique values that would warrant it becoming categorical, though the many levels would not be optimal for use in our modeling. We can remedy this by placing unique time values into bins using a Python dictionary. This will reduce the number of levels.
time_day={
5:'early morning', 6:'early morning',7:'early morning', 8:'early morning',9:'late morning', 10: 'late morning', 11: 'late morning', 12:'early afternoon', 13:'early afternoon', 14:'early afternoon', 15:'early afternoon',16:'late afternoon', 17:'late afternoon', 18:'evening',
19:'evening', 20:'night', 1:'night', 2:'night', 3:'night', 4:'night', 21:'night', 22:'night', 23:'night', 24:'night'
}
fraud_v2['IncidentPeriodDay']=fraud_v2['IncidentTime'].map(time_day)
## ***Incident Period Day Value Counts***
## night 7458
## early afternoon 5785
## early morning 5580
## late morning 3661
## late afternoon 3231
## evening 2699
## Name: IncidentPeriodDay, dtype: int64
We find from the value count output for the new feature IncidentPeriodDay that incident times have been placed into six unique periods of the day.
fraud_v3=fraud_v2.copy()
Date features used in creating new features are no longer required and will be removed from the data set
fraud_v3=fraud_v3.drop(['DateOfIncident', 'DateOfPolicyCoverage', 'IncidentTime'], axis=1)
print("fraud_v2 data frame includes datatypes object is", pd.api.types.is_object_dtype(fraud_v2.columns))
## fraud_v2 data frame includes datatypes object is True
For purposes of classification algorithms and visualizations we’ll need to convert all categorical columns (Object Data Type) to the category data type. This will be accomplished by creating a function to identify non-numerical columns and converting them to the category data type.
def convert_cats(df):
cats = []
for col in df.columns:
if pd.api.types.is_object_dtype(df[col]):
cats.append(col)
else:
pass
cat_indicies = []
for col in cats:
df[col] = df[col].astype('category')
convert_cats(fraud_v3)
We’ll write a function to review the dataset and ensure there are no columns of type object.
def check_no_object_dtype(df):
assert not any(pd.api.types.is_object_dtype(df[col]) for col in df.columns), "Error: DataFrame contains object dtype columns."
print("✅ No object dtype columns found in the DataFrame.")
check_no_object_dtype(fraud_v3)
## ✅ No object dtype columns found in the DataFrame.
Success. All columns of type object have been transformed to type category.
print(f"Shape of fraud_v2: {fraud_v2.shape}")
## Shape of fraud_v2: (28836, 45)
print(f"Shape of fraud_v3: {fraud_v3.shape}")
## Shape of fraud_v3: (28836, 42)
From the above output we observe that all object data types are now type categorical.
fraud_v3["ReportedFraud"].value_counts(normalize=True).round(2)
## N 0.73
## Y 0.27
## Name: ReportedFraud, dtype: float64
gs=plt.GridSpec(1, 3)
fig=plt.figure(figsize=(10,8))
fig.suptitle('Categorical Counts-1', fontsize=8)
ax1=fig.add_subplot(gs[0, 0])
ax2=fig.add_subplot(gs[0, 1])
ax3=fig.add_subplot(gs[0,2])
#plt.title('Type of Incident',fontsize=7, y=1)
hg=sns.countplot(data = fraud_v3, x = 'TypeOfIncident', ax=ax1)
hg.tick_params(axis='both', which='major', labelsize=4)
hg.set_xlabel("Type of Incident", fontsize=5)
hg.set_ylabel("Count",fontsize=5)
#plt.title('Type of Collision',fontsize=7, y=1)
sp=sns.countplot(data=fraud_v3, x='TypeOfCollission', ax=ax2)
sp.tick_params(axis='both', which='major', labelsize=5)
sp.set_xlabel("Type of Collision", fontsize=5)
sp.set_ylabel("Count",fontsize=4)
#plt.title('Reported Fraud',fontsize=7, y=1)
bp=sns.countplot(data=fraud_v3, x='ReportedFraud', ax=ax3)
bp.tick_params(axis='both', which='major', labelsize=5)
bp.set_xlabel("Reportered Fraud", fontsize=5)
bp.set_ylabel("Count", fontsize=5)
plt.tight_layout()
plt.show()
plt.clf()
Next, we’ll check the data for missing values (NAs or null) or values that are unknown. Unknown values may have terms denoting that the value is unknown or have symbols that indicate that the value is unknown.
print(fraud_v3.isnull().sum())
## CustomerID 0
## TypeOfIncident 0
## TypeOfCollission 0
## SeverityOfIncident 0
## AuthoritiesContacted 0
## IncidentState 0
## IncidentCity 0
## IncidentAddress 0
## NumberOfVehicles 0
## PropertyDamage 0
## BodilyInjuries 0
## Witnesses 0
## PoliceReport 0
## AmountOfInjuryClaim 0
## AmountOfPropertyClaim 0
## AmountOfVehicleDamage 0
## AmountOfTotalClaim 0
## InsuredAge 0
## InsuredZipCode 0
## InsuredGender 0
## InsuredEducationLevel 0
## InsuredOccupation 0
## InsuredHobbies 0
## CapitalGains 0
## CapitalLoss 0
## Country 0
## InsurancePolicyNumber 0
## CustomerLoyaltyPeriod 0
## InsurancePolicyState 0
## Policy_CombinedSingleLimit 0
## Policy_Deductible 0
## PolicyAnnualPremium 0
## UmbrellaLimit 0
## InsuredRelationship 0
## VehicleID 0
## VehicleMake 0
## VehicleModel 0
## VehicleYOM 0
## ReportedFraud 0
## coverageIncidentDiff 0
## dayOfWeek 0
## IncidentPeriodDay 422
## dtype: int64
with contextlib.redirect_stderr(sys.stdout):
my_tab=pd.crosstab(index=fraud_v3["TypeOfIncident"], columns=fraud_v3["TypeOfCollission"], normalize=True).round(2)
fig = plt.figure(figsize=(13, 10))
sns.heatmap(my_tab, cmap="BuGn",cbar=False, annot=True,linewidth=0.3)
plt.yticks(rotation=0)
## (array([0.5, 1.5, 2.5, 3.5]), [Text(0, 0.5, 'Multi-vehicle Collision'), Text(0, 1.5, 'Parked Car'), Text(0, 2.5, 'Single Vehicle Collision'), Text(0, 3.5, 'Vehicle Theft')])
plt.xticks(rotation=60)
## (array([0.5, 1.5, 2.5, 3.5]), [Text(0.5, 0, '?'), Text(1.5, 0, 'Front Collision'), Text(2.5, 0, 'Rear Collision'), Text(3.5, 0, 'Side Collision')])
plt.title('Type of Incident vs Type of Collision', fontsize=20)
plt.xlabel('TypeOfCollision', fontsize=15)
plt.ylabel('TypeOIncident', fontsize=15)
plt.show()
plt.clf()
We observe from the cross table that the ‘unknown’ type of collision is only associated with a small number of incident types related to collisions. These data points will be retained by renaming the “unknown” column to “none”.
fraud_v4['TypeOfCollission'] = fraud_v4['TypeOfCollission'].replace(['?'], 'None')
plt.figure(figsize=(16,10))
#plt.title("Type of Collision-Changed")
ax=sns.countplot(data=fraud_v4, x='TypeOfCollission')
#plt.tick_params(label_rotation=45)
ax.tick_params(axis='both', which='major', labelsize=11)
ax.set_title("Type of Collision-Changed", size=22)
ax.set(xlabel=None)
ax.set(ylabel=None)
sns.set_style("dark")
ax.annotate('Figure ##',
xy = (1.0, -0.2),
xycoords='axes fraction',
ha='right',
va="center",
fontsize=10)
fig.tight_layout()
plt.show()
plt.clf()
fig = plt.figure(figsize=(6, 6))
fig.tight_layout(pad=1.30,h_pad=4, w_pad=3)
fig.suptitle('Categorical Review-Two', fontsize=11)
sns.set_style("dark")
plt.subplot(331)
plt.title('Witnesses', fontsize=8, y=0.90)
dt_1=sns.countplot(data = count_plts_2, x = 'Witnesses')
dt_1.tick_params(axis='both', which='major', labelsize=4)
dt_1.tick_params(axis='x',labelrotation=35)
dt_1.set(xlabel=None)
dt_1.set(ylabel=None)
plt.subplot(332)
plt.title('Bodily Injuries',fontsize=8, y=0.90)
dt_2=sns.countplot(data = count_plts_2, x = 'BodilyInjuries')
dt_2.tick_params(axis='both', which='major', labelsize=6)
#dt_2.tick_params(axis='x',labelrotation=35)
dt_2.set(xlabel=None)
dt_2.set(ylabel=None)
plt.subplot(333)
plt.title('Property Damage',fontsize=8, y=0.90)
dt_3=sns.countplot(data = count_plts_2, x = 'PropertyDamage')
dt_3.tick_params(axis='both', which='major', labelsize=6)
#dt_3.tick_params(axis='x',labelrotation=35)
dt_3.set(xlabel=None)
dt_3.set(ylabel=None)
plt.subplot(334)
plt.title('Number Of Vehicles',fontsize=6, y=0.80)
dt_4=sns.countplot(data = count_plts_2, x = 'NumberOfVehicles')
dt_4.tick_params(axis='both', which='major', labelsize=6)
#dt_4.tick_params(axis='x',labelrotation=45)
dt_4.set(xlabel=None)
dt_4.set(ylabel=None)
plt.subplot(335)
plt.title('Incident State',fontsize=8, y=0.90)
dt_5=sns.countplot(data = count_plts_2, x = 'IncidentState')
dt_5.tick_params(axis='both', which='major', labelsize=6)
dt_5.tick_params(axis='x',labelrotation=90)
dt_5.set(xlabel=None)
dt_5.set(ylabel=None)
plt.subplot(336)
plt.title('Authorities Contacted',fontsize=8, y=0.90)
dt_6=sns.countplot(data = count_plts_2, x = 'AuthoritiesContacted')
dt_6.tick_params(axis='both', which='major', labelsize=6)
dt_6.tick_params(axis='x',labelrotation=90)
dt_6.set(xlabel=None)
dt_6.set(ylabel=None)
plt.subplot(337)
plt.title('SeverityOfIncident',fontsize=8, y=0.90)
dt_7=sns.countplot(data = count_plts_2, x ='SeverityOfIncident')
dt_7.tick_params(axis='both', which='major', labelsize=6)
dt_7.tick_params(axis='x',labelrotation=90)
dt_7.set(xlabel=None)
dt_7.set(ylabel=None)
plt.subplots_adjust(wspace=01.0, hspace=2.0)
plt.show()
plt.clf()
From the figure Categorical Review 2 we detect certain features that must be dealt with due to missing values. First, the property damage feature will be dropped due to many observations having no answer which is denoted by a question mark.
fraud_v5=fraud_v5.drop(['PropertyDamage'], axis=1)
Next, the category MISSINGVALUE from the Witnesses feature will be dropped.
fraud_v5['Witnesses']=fraud_v5['Witnesses'].cat.remove_categories("MISSINGVALUE")
plt.figure(figsize=(14,8))
#plt.title("Type of Collision-Changed")
ax=sns.countplot(data=fraud_v5, x='Witnesses')
#plt.tick_params(label_rotation=45)
ax.set_title("Witnesses-Changed", size=20)
ax.set(xlabel=None)
ax.set(ylabel=None)
ax.tick_params(axis='both', which='major', labelsize=14)
sns.set_style("dark")
ax.annotate('Figure ##',
xy = (1.0, -0.2),
xycoords='axes fraction',
ha='right',
va="center",
fontsize=10)
fig.tight_layout()
plt.show()
plt.clf()
fig = plt.figure(figsize=(10, 6))
fig.tight_layout(pad=1.40,h_pad=4, w_pad=3)
fig.suptitle('Categorical Review-Three', fontsize=13)
sns.set_style("dark")
plt.subplot(231)
plt.title('Police Report', fontsize=8, y=0.90)
et_1=sns.countplot(data = count_plts_3, x = 'PoliceReport')
et_1.tick_params(axis='both', which='major', labelsize=6)
et_1.tick_params(axis='x',labelrotation=75)
et_1.set(xlabel=None)
et_1.set(ylabel=None)
plt.subplot(232)
plt.title('Insured Gender',fontsize=8, y=0.90)
et_2=sns.countplot(data = count_plts_3, x = 'InsuredGender')
et_2.tick_params(axis='both', which='major', labelsize=6)
et_2.tick_params(axis='x',labelrotation=75)
et_2.set(xlabel=None)
et_2.set(ylabel=None)
plt.subplot(233)
plt.title('Insurance Policy State',fontsize=8, y=0.90)
et_4=sns.countplot(data = count_plts_3, x = 'InsurancePolicyState')
et_4.tick_params(axis='both', which='major', labelsize=6)
et_4.tick_params(axis='x',labelrotation=70)
et_4.set(xlabel=None)
plt.subplot(234)
plt.title('Insured Education Level',fontsize=7, y=0.90)
et_3=sns.countplot(data = count_plts_3, x = 'InsuredEducationLevel')
et_3.tick_params(axis='both', which='major', labelsize=6)
et_3.tick_params(axis='x',labelrotation=90)
et_3.set(xlabel=None)
et_3.set(ylabel=None)
plt.subplot(235)
plt.title('Insured Relationship',fontsize=8, y=0.90)
et_5=sns.countplot(data = count_plts_3, x = 'InsuredRelationship')
et_5.tick_params(axis='both', which='major', labelsize=6)
et_5.tick_params(axis='x',labelrotation=90)
et_5.set(xlabel=None)
et_5.set(ylabel=None)
plt.subplot(236)
plt.title('Day of Week',fontsize=8, y=0.90)
et_6=sns.countplot(data = count_plts_3, x = 'dayOfWeek')
et_6.tick_params(axis='both', which='major', labelsize=6)
et_6.tick_params(axis='x',labelrotation=90)
et_6.set(xlabel=None)
et_6.set(ylabel=None)
plt.subplots_adjust(wspace=01.0, hspace=1.4)
plt.show()
plt.clf()
fraud_v5['Witnesses']=fraud_v5['Witnesses'].cat.remove_unused_categories()
Categorical Review 3 informs us that there are additional categorical features which must be either cleaned or dropped. First, the feature “Police Report” has close to 10000 missing or unknown values (denoted by a question mark). This feature will be dropped.
fraud_v6=fraud_v6.drop(['PoliceReport'], axis=1)
The next feature requiring attention is InsuredGender. There are a small number of missing values, denoted by NA. This category will be removed from InsuredGender. The omission of this small count category will have no effect on our models.
fraud_v6['InsuredGender']=fraud_v6['InsuredGender'].cat.remove_categories("NA")
fraud_v6['InsuredGender']=fraud_v6['InsuredGender'].cat.remove_unused_categories()
plt.figure(figsize=(14,10))
#plt.title("Type of Collision-Changed")
ax=sns.countplot(data=fraud_v6, x='InsuredGender')
#plt.tick_params(label_rotation=45)
ax.set_title("Insured Gender-Changed", size=25)
ax.set(xlabel=None)
ax.set(ylabel=None)
ax.tick_params(axis='both',labelsize = 15)
sns.set_style("dark")
ax.annotate('Figure ##',
xy = (1.0, -0.2),
xycoords='axes fraction',
ha='right',
va="center",
fontsize=10)
fig.tight_layout()
plt.show()
plt.clf()
## *******premium_missing shape*******
## (141, 40)
## *******fraud_v6 shape*******
## (28836, 40)
plt.figure(figsize=(16,6))
ax=sns.countplot(data=fraud_v6, x='VehicleMake')
ax.set_title("Vehicle Make", size=25)
ax.set(xlabel=None)
ax.set(ylabel=None)
ax.tick_params(axis='x',labelrotation=60,labelsize =13)
ax.tick_params(axis='y', labelsize=13)
sns.set_style("dark")
ax.annotate('Figure ##',
xy = (1.0, -0.2),
xycoords='axes fraction',
ha='right',
va="center",
fontsize=10)
fig.tight_layout()
plt.show()
plt.clf()
VehicleMake has a small number of missing values (denoted by ‘???’). The category ‘???’ will be removed from the feature.
fraud_v7['VehicleMake']=fraud_v7['VehicleMake'].cat.remove_categories("???")
fraud_v7['VehicleMake']=fraud_v7['VehicleMake'].cat.remove_unused_categories()
veh_mk=vehicle_count.groupby('VehicleMake')['count'].agg('count').reset_index()
plt.figure(figsize=(12,8))
fig, axes=plt.subplots()
line_colors=['blue', 'cyan', 'green', 'red','skyblue','maroon', 'salmon', 'yellow',
'orange','lightgreen','darkviolet', 'fuchsia','darkmagenta','lime' ]
axes.hlines(veh_mk['VehicleMake'], xmin=0,
xmax=veh_mk['count'],colors=line_colors)
axes.plot(veh_mk['count'],veh_mk['VehicleMake'],"o")
axes.set_xlim(0)
## (0.0, 2535.75)
axes.tick_params(axis='both', which='major', labelsize=10)
plt.title('Make of Vehicle Count', fontsize=20)
plt.show()
plt.clf()
VechicleMake feature now has no missing values.
Filtering for any PolicyAnnualPremium value that is equal to -1 we find 141 values returned. From the Attribute Information pdf provided with the data set we know that -1 represents a missing value. All observations with -1 will be removed.
fraud_v7=fraud_v7[fraud_v7['PolicyAnnualPremium']!=-1]
print('**Policy Annual Premium Shape**')
## **Policy Annual Premium Shape**
fraud_v7[fraud_v7['PolicyAnnualPremium']==-1].shape
## (0, 40)
From the shape output we can observe all values of -1 have been removed.
Certain visualizations require numeric only data. We’ll create a date set that contains only numeric data types.
#select only the numeric columns in the DataFrame
numeric_data=fraud_v7.select_dtypes(include=np.number)
numeric_data=numeric_data.drop(['InsuredZipCode', 'InsurancePolicyNumber'], axis=1)
## ******************Numeric Data Types******************
## AmountOfInjuryClaim int32
## AmountOfPropertyClaim int32
## AmountOfVehicleDamage int32
## AmountOfTotalClaim int32
## InsuredAge int32
## CapitalGains int32
## CapitalLoss int32
## CustomerLoyaltyPeriod int32
## Policy_Deductible int32
## PolicyAnnualPremium float64
## UmbrellaLimit int32
## coverageIncidentDiff float64
## dtype: object
The data set numeric_data only includes features of numeric data types as seen from the above output.
plt.figure(figsize=(10, 7))
plt.tick_params(axis='both', which='major', labelsize=9)
plt.title('Correlation Heatmap', fontsize=12)
# define the mask to set the values in the upper triangle to Truemask
mask=np.triu(np.ones_like(numeric_data.corr(), dtype=bool))
# Generate a custom diverging colormap
#cmap = sns.diverging_palette(220, 10, as_cmap=True)
#ht_mp=sns.heatmap(fraud_train_v8.corr(), cmap=cmap, vmax=.3, center=0,annot=True,
#square=True, linewidths=.5, cbar_kws={"shrink": .5})
heatmap = sns.heatmap(numeric_data.corr(), mask=mask,vmin=-1, vmax=1, annot=True, cmap='BrBG', annot_kws={"size": 4})
plt.show()
plt.clf()
There is very high to high correlation between Amount of Injury Claim, Amount of Property Claim, Amount of Vehicle Damage, and Amount of Total Claim. This is unsurprising as Amount of Total Claim is the sum of the other three. Amount of Total Claim is the only feature of the four that will be used for our machine learning models.
Other features exhibiting very high correlation are Loyalty period and Age. This makes sense as older customers have the chance to accrue loyalty time based on having lived longer than younger customers. Still, we will retain both features for our models.
Features not important for visualizing or building models will now be dropped.
fraud_v8=fraud_v8.drop(['CustomerID', 'IncidentAddress', 'InsuredZipCode', 'InsuredHobbies','Country', 'InsurancePolicyNumber', 'VehicleID'], axis=1)
## **fraud_v8 shape**
## (28695, 33)
fig = plt.figure(figsize=(11, 6))
fig.suptitle('Amount of Total Claim', fontsize=11)
sns.set_style("dark")
plt.subplot(131)
plt.title('Box Plot-Total Claim and Reported Fraud', fontsize=7)
ac_1=sns.boxplot(data = fraud_v8, x = "AmountOfTotalClaim", y='ReportedFraud')
ac_1.tick_params(axis='x', which='major', labelsize=5)
ac_1.tick_params(axis='y', labelsize=5)
ac_1.tick_params(axis='x', labelrotation=60)
ac_1.set(xlabel=None)
plt.subplot(132)
plt.title('Histogram-Amount of Total Claim', fontsize=7)
ac_2=sns.histplot(data=fraud_v8, x="AmountOfTotalClaim")
ac_2.tick_params(axis='x', which='major', labelsize=5)
ac_2.tick_params(labelrotation=60)
ac_2.tick_params(axis='y', labelsize=5)
ac_2.set(xlabel=None)
plt.subplot(133)
plt.title('Histogram-Amount of Total Claim and Reported Fraud', fontsize=7)
ac_3=sns.histplot(data=fraud_v8, x="AmountOfTotalClaim", hue="ReportedFraud")
ac_3.tick_params(axis='x', which='major', labelsize=5)
ac_3.tick_params(axis='y', labelsize=5)
ac_3.set(xlabel=None)
plt.subplots_adjust(wspace=0.45)
plt.show()
plt.clf()
fig = plt.figure(figsize=(11, 6))
fig.suptitle('Insured Age Review', fontsize=11)
sns.set_style("dark")
plt.subplot(131)
plt.title('Box Plot-Insured Age and Reported Fraud', fontsize=7)
ia_1=sns.boxplot(data = fraud_v8, x = "InsuredAge", y='ReportedFraud')
ia_1.tick_params(axis='x', which='major', labelsize=5)
ia_1.tick_params(axis='y', labelsize=5)
ia_1.tick_params(axis='x', labelrotation=60)
ia_1.set(xlabel=None)
plt.subplot(132)
plt.title('Histogram- Insured Age', fontsize=7)
ia_2=sns.histplot(data=fraud_v8, x="InsuredAge")
ia_2.tick_params(axis='x', which='major', labelsize=5)
ia_2.tick_params(labelrotation=60)
ia_2.tick_params(axis='y', labelsize=5)
ia_2.set(xlabel=None)
plt.subplot(133)
plt.title('Histogram-Insured Age and Reported Fraud', fontsize=7)
ia_3=sns.histplot(data=fraud_v8, x="InsuredAge",hue="ReportedFraud")
ia_3.tick_params(axis='x', which='major', labelsize=5)
ia_3.tick_params(axis='y', labelsize=5)
ia_3.set(xlabel=None)
plt.subplots_adjust(wspace=0.45)
plt.show()
plt.clf()
fig = plt.figure(figsize=(11, 6))
fig.suptitle('Policy Annual Premium', fontsize=11)
sns.set_style("dark")
plt.subplot(131)
plt.title('Box Plot-AnnualPremium and Reported Fraud', fontsize=7)
ac_1=sns.boxplot(data = fraud_v8, x = "PolicyAnnualPremium", y='ReportedFraud')
ac_1.tick_params(axis='x', which='major', labelsize=5)
ac_1.tick_params(axis='y', labelsize=5)
ac_1.tick_params(axis='x', labelrotation=60)
ac_1.set(xlabel=None)
plt.subplot(132)
plt.title('Histogram-Amount of Annual Premium', fontsize=7)
ac_2=sns.histplot(data=fraud_v8, x="PolicyAnnualPremium")
ac_2.tick_params(axis='x', which='major', labelsize=5)
ac_2.tick_params(labelrotation=60)
ac_2.tick_params(axis='y', labelsize=5)
ac_2.set(xlabel=None)
plt.subplot(133)
plt.title('Histogram-Annual Premium and Reported Fraud', fontsize=7)
ac_3=sns.histplot(data=fraud_v8, x="PolicyAnnualPremium", hue="ReportedFraud")
ac_3.tick_params(axis='x', which='major', labelsize=5)
ac_3.tick_params(axis='y', labelsize=5)
ac_3.set(xlabel=None)
plt.subplots_adjust(wspace=0.45)
plt.show()
plt.clf()
fig = plt.figure(figsize=(11, 6))
fig.suptitle('Customer Loyalty Period', fontsize=11)
sns.set_style("dark")
plt.subplot(131)
plt.title('Box Plot-Customer Loyalty Period and Reported Fraud', fontsize=7)
ac_1=sns.boxplot(data = fraud_v8, x = "CustomerLoyaltyPeriod", y='ReportedFraud')
ac_1.tick_params(axis='x', which='major', labelsize=5)
ac_1.tick_params(axis='y', labelsize=5)
ac_1.tick_params(axis='x', labelrotation=60)
ac_1.set(xlabel=None)
plt.subplot(132)
plt.title('Histogram-Customer Loyalty Period', fontsize=7)
ac_2=sns.histplot(data=fraud_v8, x="CustomerLoyaltyPeriod")
ac_2.tick_params(axis='x', which='major', labelsize=5)
ac_2.tick_params(labelrotation=60)
ac_2.tick_params(axis='y', labelsize=5)
ac_2.set(xlabel=None)
plt.subplot(133)
plt.title('Histogram-Customer Loyalty Period and Reported Fraud', fontsize=7)
ac_3=sns.histplot(data=fraud_v8, x="CustomerLoyaltyPeriod", hue="ReportedFraud")
ac_3.tick_params(axis='x', which='major', labelsize=5)
ac_3.tick_params(axis='y', labelsize=5)
ac_3.set(xlabel=None)
plt.subplots_adjust(wspace=0.45)
plt.show()
plt.clf()
fig = plt.figure(figsize=(11, 6))
fig.suptitle('Difference Coverage Start and Incident', fontsize=11)
sns.set_style("dark")
plt.subplot(131)
plt.title('Box Plot-Coverage Start Incident Difference and Reported Fraud', fontsize=7)
ac_1=sns.boxplot(data = fraud_v8, x = "coverageIncidentDiff", y='ReportedFraud')
ac_1.tick_params(axis='x', which='major', labelsize=5)
ac_1.tick_params(axis='y', labelsize=5)
ac_1.tick_params(axis='x', labelrotation=60)
ac_1.set(xlabel=None)
plt.subplot(132)
plt.title('Histogram-Coverage Start Incident Differnce', fontsize=7)
ac_2=sns.histplot(data=fraud_v8, x="coverageIncidentDiff")
ac_2.tick_params(axis='x', which='major', labelsize=5)
ac_2.tick_params(labelrotation=60)
ac_2.tick_params(axis='y', labelsize=5)
ac_2.set(xlabel=None)
plt.subplot(133)
plt.title('Histogram-Coverage Start Incident Differncve and Reported Fraud', fontsize=7)
ac_3=sns.histplot(data=fraud_v8, x="coverageIncidentDiff", hue="ReportedFraud")
ac_3.tick_params(axis='x', which='major', labelsize=5)
ac_3.tick_params(axis='y', labelsize=5)
ac_3.set(xlabel=None)
plt.subplots_adjust(wspace=0.45)
plt.show()
plt.clf()
## **** Year Of Make****
## 2015 416
## 1995 531
## 1996 828
## 2014 871
## 1997 1131
## 2013 1256
## 1998 1276
## 2012 1308
## 2001 1428
## 1999 1479
## 2011 1518
## 2000 1523
## 2002 1527
## 2003 1571
## 2008 1622
## 2009 1623
## 2010 1631
## 2005 1635
## 2006 1637
## 2004 1661
## 2007 1709
## Name: VehicleYOM, dtype: int64
plt.figure(figsize=(16,6))
ax=sns.countplot(data=fraud_v8, x='VehicleYOM')
ax.set_title("Vehicle Year of Make", size=25)
ax.set(xlabel=None)
ax.set(ylabel=None)
ax.tick_params(axis='x',labelrotation=60,labelsize =13)
ax.tick_params(axis='y', labelsize=13)
sns.set_style("dark")
ax.annotate('Figure ##',
xy = (1.0, -0.2),
xycoords='axes fraction',
ha='right',
va="center",
fontsize=10)
fig.tight_layout()
plt.show()
plt.clf()
Auto insurance premiums are generally based on personal details like choice of coverage, type of vehicle driven, and age of car. The newer the car, typically the more expensive the insurance. This is based on the vehicle’s replacement cost. The year the car was manufactured plays just as big a part in the premium as the make and model itself. The above plot displays all years of vehicle make in our data set. We find that there are just over 6000 auto’s that have an age of 15 years or greater compared to the last year of 2015.
fig = plt.figure(figsize=(11, 6))
fig.suptitle('Umbrella Limit Review', fontsize=12)
sns.set_style("dark")
plt.subplot(131)
plt.title('Box Plot-Umbrella Limit and Reported Fraud', fontsize=7)
ul_1=sns.boxplot(data = fraud_v8, x = "UmbrellaLimit", y='ReportedFraud')
ul_1.tick_params(axis='x', which='major', labelsize=5)
ul_1.tick_params(axis='y', labelsize=5)
ul_1.tick_params(axis='x', labelrotation=60)
ul_1.set(xlabel=None)
plt.subplot(132)
plt.title('Histogram-UmbrellaLimit', fontsize=7)
ul_2=sns.histplot(data=fraud_v8, x="UmbrellaLimit",bins=20)
ul_2.tick_params(axis='x', which='major', labelsize=5)
ul_2.tick_params(axis='x',labelrotation=60)
ul_2.tick_params(axis='y', labelsize=5)
ul_2.set(xlabel=None)
plt.subplot(133)
plt.title('Histogram-UmbrellaLimit and Reported Fraud', fontsize=7)
ul_3=sns.histplot(data=fraud_v8, x="UmbrellaLimit",hue='ReportedFraud',bins=20)
ul_3.tick_params(axis='x', which='major', labelsize=5)
ul_3.tick_params(axis='y', labelsize=5)
ul_3.set(xlabel=None)
plt.subplots_adjust(wspace=0.45)
plt.show()
plt.clf()
The above plots are unusual. Both box plots have a median of zero. Reported Fraud=Y has a mean of 1,000,000 while Reported Fraud=“N” mean is 918,000. Both histograms have their peak at zero and a log tail to the right.
There are only 7506 data points greater than zero. 2417 for Yes and 5089 for No. Data points greater than zero represent only 26 percent of the entire data set. Normally this would seem unusual, and we would review the raw data for errors. Checking the description of umbrella limit we find that such extreme data points are not uncommon. Umbrella insurance provides “excess liability insurance” beyond the liability insurance already in auto insurance coverage. It’s for expensive situations where medical bills and/or repairs exceed those in “base” auto policies. Auto policy holders who fall in the higher income brackets are usually the purchasers of umbrella limit. Thus, for all data points, the mean of 972,000 and max of 10,000,000 are common values. Additionally, the mean of zero is not unsurprising as not many insured choose umbrella limits for their policies.
sns.set(style="darkgrid")
plt.figure(figsize=(10, 7))
# top bar -> sum all values(ReportedFraud=No and # ReportedFraud=Yes) to find y position of the bars
total = fraud_v8.groupby('SeverityOfIncident')['count'].sum().reset_index()
# bar chart 1 -> top bars (group of #'ReportedFraud=No')
bar1 = sns.barplot(x="SeverityOfIncident", y="count", data=total, color='darkblue')
# bottom bar -> take only ReportedFraud=Yes values #from the data
fraud = fraud_v8[fraud_v8.ReportedFraud=='Y']
# bar chart 2 -> bottom bars (group of #'ReportedFraud=Yes')
bar2 = sns.barplot(x="SeverityOfIncident", y="count", data=fraud, estimator=sum, errorbar=None, color='lightblue')
# add legend
top_bar = mpatches.Patch(color='darkblue', label='Fraud = No')
bottom_bar = mpatches.Patch(color='lightblue', label='Fraud = Yes')
plt.legend(handles=[top_bar, bottom_bar],fontsize=9, loc="upper right")
plt.tick_params(axis='x', which='major', labelsize=8, labelrotation=75)
plt.tick_params(axis='y', which='major', labelsize=8)
plt.title(" Reported Fraud and Severity Of Incident", fontsize=15)
plt.xlabel(None)
plt.ylabel(None)
plt.show()
plt.clf()
The above plot displays bar plots of categories belonging to the feature ‘severity of incident’ stacked based on whether fraud is ‘Y’ or ‘N’. ‘Major Damage’ stands out as 60% of claims are reported as fraud whereas the other categories have claims reported as fraud under 16%.
with contextlib.redirect_stderr(sys.stdout):
grouped_veh_mk=fraud_v8.groupby(['VehicleMake','ReportedFraud']).agg({'count':'sum'})
grouped_veh_mk_perc=grouped_veh_mk.groupby(level=0, group_keys=False).apply(lambda x: x /x.sum()).round(2)
grouped_veh_mk_perc.rename(columns={'count':'Percent'}, inplace=True)
#Convert from multi index to single index.
grouped_veh_mk_perc_single=grouped_veh_mk_perc.reset_index(level=[1])
grouped_veh_mk_perc_single=grouped_veh_mk_perc_single.reset_index()
#Pivot wider. This makes "Y" and "N" seperate columns
grouped_veh_mk_perc_wide=grouped_veh_mk_perc_single.pivot(index='VehicleMake',columns='ReportedFraud',values='Percent').reset_index()
#Reorder df following 'N'
grouped_veh_mk_ordered=grouped_veh_mk_perc_wide.sort_values(by='N')
my_range=range(1,len(grouped_veh_mk_ordered.index)+1)
plt.figure(figsize=(9, 9))
plt.hlines(y=my_range, xmin=grouped_veh_mk_ordered['N'], xmax=grouped_veh_mk_ordered['Y'], color='grey', alpha=0.4)
plt.scatter(grouped_veh_mk_ordered['N'], my_range, color='skyblue', alpha=1, label='N')
plt.scatter(grouped_veh_mk_ordered['Y'], my_range, color='green', alpha=0.4 , label='Y')
plt.legend(title="Reported Fraud", loc="lower right", title_fontsize=18,fontsize=6, borderpad=0, facecolor="wheat")
plt.yticks(my_range, grouped_veh_mk_ordered['VehicleMake'])
## ([<matplotlib.axis.YTick object at 0x35fcfb940>, <matplotlib.axis.YTick object at 0x35f7d7640>, <matplotlib.axis.YTick object at 0x35f819870>, <matplotlib.axis.YTick object at 0x35fe9d9f0>, <matplotlib.axis.YTick object at 0x35fe6b160>, <matplotlib.axis.YTick object at 0x34461b7c0>, <matplotlib.axis.YTick object at 0x344600e20>, <matplotlib.axis.YTick object at 0x336d9f040>, <matplotlib.axis.YTick object at 0x344602290>, <matplotlib.axis.YTick object at 0x344603bb0>, <matplotlib.axis.YTick object at 0x344601f60>, <matplotlib.axis.YTick object at 0x360121750>, <matplotlib.axis.YTick object at 0x344602f20>, <matplotlib.axis.YTick object at 0x35fdc0cd0>], [Text(0, 1, 'Audi'), Text(0, 2, 'BMW'), Text(0, 3, 'Ford'), Text(0, 4, 'Mercedes'), Text(0, 5, 'Volkswagen'), Text(0, 6, 'Dodge'), Text(0, 7, 'Chevrolet'), Text(0, 8, 'Suburu'), Text(0, 9, 'Honda'), Text(0, 10, 'Toyota'), Text(0, 11, 'Saab'), Text(0, 12, 'Nissan'), Text(0, 13, 'Accura'), Text(0, 14, 'Jeep')])
plt.title("Reported Fraud by Vehicle Make", fontsize=15,loc='center')
plt.xlabel('Percent', fontsize=6)
plt.ylabel(None)
plt.tick_params(axis='x', which='major', labelsize=5, labelrotation=90)
plt.tick_params(axis='y', which='major', labelsize=5)
plt.show()
plt.clf()
We find that Volkswagen, Mercedes, Ford, BMW, and Audi are the vehicle makes with reported fraud over 30%. This is an interesting statistic though due to the large number of categories we’ll explore the ‘Vehicle Make’ feature further.
Box plots show the median total claims is roughly the same for all models.
Nissan, Subaru, and Toyota have a median capital gain near 20,000, substantially larger than all other makes. The vehicle makes over 30% reported fraud all have zero medians.
Due to the number of categories of “Vehicle Make” we will exclude it from the modeling process.
grouped_inc_st=fraud_v8.groupby(['IncidentState','ReportedFraud']).agg({'count':'sum'})
grouped_inc_st_perc=grouped_inc_st.groupby(level=0, group_keys=False).apply(lambda x: x /x.sum()).round(2)
#numeric_only
grouped_inc_st_perc.rename(columns={'count':'Percent'}, inplace=True)
grouped_inc_st_perc_single=grouped_inc_st_perc.reset_index(level=[1])
grouped_inc_st_perc_single=grouped_inc_st_perc_single.reset_index()
grouped_inc_st_perc_wide=grouped_inc_st_perc_single.pivot(index='IncidentState',columns='ReportedFraud',values='Percent').reset_index()
grouped_inc_st_ordered=grouped_inc_st_perc_wide.sort_values(by='N')
my_range_2=range(1,len(grouped_inc_st_ordered.index)+1)
plt.figure(figsize=(9, 9))
plt.hlines(y=my_range_2, xmin=grouped_inc_st_ordered['N'], xmax=grouped_inc_st_ordered['Y'], color='grey', alpha=0.4)
plt.scatter(grouped_inc_st_ordered['N'], my_range_2, color='skyblue', alpha=1, label='N')
plt.scatter(grouped_inc_st_ordered['Y'], my_range_2, color='green', alpha=0.4 , label='Y')
plt.legend(title="Reported Fraud", loc="lower right", title_fontsize=8,fontsize=6, borderpad=0, facecolor="wheat")
plt.yticks(my_range_2, grouped_inc_st_ordered['IncidentState'])
## ([<matplotlib.axis.YTick object at 0x35f05e320>, <matplotlib.axis.YTick object at 0x35f7eceb0>, <matplotlib.axis.YTick object at 0x35f05dbd0>, <matplotlib.axis.YTick object at 0x35f0e88b0>, <matplotlib.axis.YTick object at 0x35f0ea1a0>, <matplotlib.axis.YTick object at 0x35f0eb8b0>, <matplotlib.axis.YTick object at 0x35f0af370>], [Text(0, 1, 'State3'), Text(0, 2, 'State4'), Text(0, 3, 'State7'), Text(0, 4, 'State6'), Text(0, 5, 'State8'), Text(0, 6, 'State5'), Text(0, 7, 'State9')])
plt.title("Reported Fraud by Incident State", loc='center')
plt.xlabel('Percent', fontsize=6)
plt.ylabel('Incident State')
plt.tick_params(axis='x', which='major', labelsize=5, labelrotation=90)
plt.tick_params(axis='y', which='major', labelsize=5)
plt.show()
plt.clf()
Incident States 4,6, and 7 have reported fraud just over 30% which appears significant. However, Incident state 3 stands out from the other states with a reported fraud around 42%
grouped_type_inc=fraud_v8.groupby(['TypeOfIncident','ReportedFraud']).agg({'count':'sum'})
grouped_type_inc_perc=grouped_type_inc.groupby(level=0, group_keys=False).apply(lambda x: x /x.sum()).round(2)
grouped_type_inc_perc.rename(columns={'count':'Percent'}, inplace=True)
grouped_type_inc_perc_single=grouped_type_inc_perc.reset_index(level=[1])
grouped_type_inc_perc_single=grouped_type_inc_perc_single.reset_index()
grouped_type_inc_perc_wide=grouped_type_inc_perc_single.pivot(index='TypeOfIncident',columns='ReportedFraud',values='Percent').reset_index()
grouped_type_inc_ordered=grouped_type_inc_perc_wide.sort_values(by='N')
my_range_3=range(1,len(grouped_type_inc_ordered.index)+1)
plt.figure(figsize=(9, 9))
plt.hlines(y=my_range_3, xmin=grouped_type_inc_ordered['N'], xmax=grouped_type_inc_ordered['Y'], color='grey', alpha=0.4)
plt.scatter(grouped_type_inc_ordered['N'], my_range_3, color='skyblue', alpha=1, label='N')
plt.scatter(grouped_type_inc_ordered['Y'], my_range_3, color='green', alpha=0.4 , label='Y')
plt.legend(title="Reported Fraud", loc="lower right", title_fontsize=8,fontsize=6, borderpad=0, facecolor="wheat")
plt.yticks(my_range_3, grouped_type_inc_ordered['TypeOfIncident'])
## ([<matplotlib.axis.YTick object at 0x35e76aaa0>, <matplotlib.axis.YTick object at 0x35e76beb0>, <matplotlib.axis.YTick object at 0x3447061d0>, <matplotlib.axis.YTick object at 0x35e768b20>], [Text(0, 1, 'Single Vehicle Collision'), Text(0, 2, 'Multi-vehicle Collision'), Text(0, 3, 'Vehicle Theft'), Text(0, 4, 'Parked Car')])
plt.title("Reported Fraud by Type of Incident", loc='center')
plt.xlabel('Percent', fontsize=6)
plt.ylabel('Incident')
plt.tick_params(axis='x', which='major', labelsize=5, labelrotation=90)
plt.tick_params(axis='y', which='major', labelsize=5)
plt.show()
plt.clf()
From the above plots we observe two categories stand out with respect to reported fraud. ‘Single Vehicle Collision’ and ‘Multi-vehicle collision’ from the feature ‘Type of Incident’ have claims reported as fraud at 31% and 29% respectively. The other two categories are under 14%.
We’ll now review our data for outliers. Our goal is not
necessarily to remove observations that are indicated as outliers, it is
to derive insights that may help us understand reported fraud in
combination with our machine learning models.
fig = plt.figure(figsize=(6, 6))
fig.tight_layout(pad=1.30,h_pad=4, w_pad=3)
fig.suptitle('Numerical Feature Distributions', fontsize=11)
sns.set_style("dark")
plt.subplot(431)
plt.title('Policy Annual Premiums',fontsize=8, y=0.90)
dt_1=sns.histplot(data = numeric_data, x ='PolicyAnnualPremium')
dt_1.tick_params(axis='both', which='major', labelsize=4)
#dt_1.tick_params(axis='x',labelrotation=35)
dt_1.set(xlabel=None)
dt_1.set(ylabel=None)
plt.subplot(432)
plt.title('Umbrella Limit',fontsize=8, y=0.90)
dt_2=sns.histplot(data = numeric_data, x ='UmbrellaLimit')
dt_2.tick_params(axis='both', which='major', labelsize=6)
#dt_2.tick_params(axis='x',labelrotation=35)
dt_2.set(xlabel=None)
dt_2.set(ylabel=None)
plt.subplot(433)
plt.title('coverage Incident Difference',fontsize=8, y=0.90)
dt_3=sns.histplot(data = numeric_data, x ='coverageIncidentDiff')
dt_3.tick_params(axis='both', which='major', labelsize=6)
#dt_3.tick_params(axis='x',labelrotation=35)
dt_3.set(xlabel=None)
dt_3.set(ylabel=None)
plt.subplot(434)
plt.title('Amount Of Total Claim',fontsize=6, y=0.80)
dt_4=sns.histplot(data = numeric_data, x = 'AmountOfTotalClaim')
dt_4.tick_params(axis='both', which='major', labelsize=6)
#dt_4.tick_params(axis='x',labelrotation=45)
dt_4.set(xlabel=None)
dt_4.set(ylabel=None)
plt.subplot(435)
plt.title('Insured Age',fontsize=8, y=0.90)
dt_5=sns.histplot(data = numeric_data, x = 'InsuredAge')
dt_5.tick_params(axis='both', which='major', labelsize=6)
#dt_5.tick_params(axis='x',labelrotation=90)
dt_5.set(xlabel=None)
dt_5.set(ylabel=None)
plt.subplot(436)
plt.title('Capital Gains',fontsize=8, y=0.90)
dt_6=sns.histplot(data = numeric_data, x = 'CapitalGains')
dt_6.tick_params(axis='both', which='major', labelsize=6)
#dt_6.tick_params(axis='x',labelrotation=90)
dt_6.set(xlabel=None)
dt_6.set(ylabel=None)
plt.subplot(437)
plt.title('Capital Loss',fontsize=8, y=0.90)
dt_7=sns.histplot(data = numeric_data, x ='CapitalLoss')
dt_7.tick_params(axis='both', which='major', labelsize=6)
#dt_7.tick_params(axis='x',labelrotation=90)
dt_7.set(xlabel=None)
dt_7.set(ylabel=None)
plt.subplot(438)
plt.title('Customer Loyalty Period',fontsize=8, y=0.90)
dt_8=sns.histplot(data = numeric_data, x ='CustomerLoyaltyPeriod')
dt_8.tick_params(axis='both', which='major', labelsize=6)
#dt_7.tick_params(axis='x',labelrotation=90)
dt_8.set(xlabel=None)
dt_8.set(ylabel=None)
plt.subplot(439)
plt.title('Policy Deductible',fontsize=8, y=0.90)
dt_9=sns.histplot(data = numeric_data, x ='Policy_Deductible')
dt_9.tick_params(axis='both', which='major', labelsize=6)
#dt_7.tick_params(axis='x',labelrotation=90)
dt_9.set(xlabel=None)
dt_9.set(ylabel=None)
plt.subplots_adjust(wspace=01.0, hspace=2.0)
plt.show()
plt.clf()
The above plots indicate that certain numerical features exhibit distributions that are not normal this we will zoom in on these features.
n_bins=np.sqrt(len(numeric_data))
Cast to an integer
n_bins=int(n_bins)
integers_um=range(len(numeric_data["UmbrellaLimit"]))
gs=plt.GridSpec(2, 2)
fig=plt.figure(figsize=(10,8))
fig.suptitle('Umbrella Limit', fontsize=8)
ax1=fig.add_subplot(gs[0, 0])
ax2=fig.add_subplot(gs[1, 0])
ax3=fig.add_subplot(gs[:, 1])
#plt.title('Histogram',fontsize=7, y=1)
hg=sns.histplot(data = numeric_data, x = 'UmbrellaLimit', bins=n_bins,ax=ax1)
hg.tick_params(axis='both', which='major', labelsize=4)
hg.set_xlabel("Umbrella Limit", fontsize=5)
hg.set_ylabel("Count",fontsize=5)
#plt.title('Scatter Plot',fontsize=7, y=1)
sp=sns.scatterplot(data=numeric_data, x=integers_um, y='UmbrellaLimit', ax=ax2)
sp.tick_params(axis='both', which='major', labelsize=3.5)
sp.set_xlabel("Index", fontsize=5)
sp.set_ylabel("Umbrella Limit",fontsize=4)
plt.title('Boxplot',fontsize=7, y=1)
bp=sns.boxplot(data=numeric_data, y='UmbrellaLimit', ax=ax3)
bp.tick_params(axis='both', which='major', labelsize=5)
bp.set_xlabel("Unbrella Limit", fontsize=5)
bp.set_ylabel(None)
plt.tight_layout()
plt.show()
plt.clf()
The right tail shows many bars with a height of nearly zero, far off from the bulk of the histogram. This suggests they might be outliers.
The scatter plot, we see many suspicious points that are around 0.9 . The boxplot has points above 0.1 that may be outliers.
integers_tc=range(len(numeric_data["AmountOfTotalClaim"]))
gs=plt.GridSpec(2, 2)
fig=plt.figure(figsize=(10,8))
ax1=fig.add_subplot(gs[0, 0])
ax2=fig.add_subplot(gs[1, 0])
ax3=fig.add_subplot(gs[:, 1])
hg=sns.histplot(data = numeric_data, x = 'AmountOfTotalClaim', ax=ax1)
hg.tick_params(axis='both', which='major', labelsize=4)
hg.set_xlabel("Umbrella Limit", fontsize=5)
hg.set_ylabel("Count",fontsize=5)
sp=sns.scatterplot(data=numeric_data, x=integers_tc, y="AmountOfTotalClaim", ax=ax2)
sp.tick_params(axis='both', which='major', labelsize=3.5)
sp.set_xlabel("Index", fontsize=5)
sp.set_ylabel("Total Claim",fontsize=4)
bp=sns.boxplot(data=numeric_data, y="AmountOfTotalClaim", ax=ax3)
bp.tick_params(axis='both', which='major', labelsize=5)
bp.set_xlabel("Unbrella Limit", fontsize=5)
bp.set_ylabel(None)
plt.tight_layout()
plt.show()
plt.clf()
## ******Total Claim Description******
## count 28695.000000
## mean 52303.964733
## std 25109.177907
## min 150.000000
## 25% 44612.500000
## 50% 58362.000000
## 75% 68975.500000
## max 114920.000000
## Name: AmountOfTotalClaim, dtype: float64
print("Ten smallest Total Claim Amounts\n",numeric_data['AmountOfTotalClaim'].nsmallest(10))
## Ten smallest Total Claim Amounts
## 17433 150
## 17427 313
## 23140 334
## 22996 489
## 4654 547
## 9308 598
## 17430 681
## 18674 725
## 23136 812
## 12639 838
## Name: AmountOfTotalClaim, dtype: int32
print("Ten Largest Total Claim Amounts\n",numeric_data['AmountOfTotalClaim'].nlargest(10))
## Ten Largest Total Claim Amounts
## 97 114920
## 2421 114141
## 14909 114113
## 18186 113997
## 27450 113771
## 6396 112817
## 2535 112560
## 7357 111870
## 25936 111771
## 17390 111708
## Name: AmountOfTotalClaim, dtype: int32
The histogram has two peaks with one peak near zero and the second peak near 60000 which may not be outliers. The scatter plot and boxplot both have points around 11000 and near zero that could be outliers. From the descriptive statistics we find the minimum value of 150 is substantially lower than the 25 percent quantile of 44,612. Likewise, the maximum value of 114,920 is substantially higher than the 75 percent quantile of 68,975. The ten lowest and ten highest numbers support this. Thus, it’s possible that these values are outliers.
integers_cg=range(len(numeric_data["CapitalGains"]))
gs=plt.GridSpec(2, 2)
fig=plt.figure(figsize=(10,8))
ax1=fig.add_subplot(gs[0, 0])
ax2=fig.add_subplot(gs[1, 0])
ax3=fig.add_subplot(gs[:, 1])
hg=sns.histplot(data = numeric_data, x = "CapitalGains", ax=ax1)
hg.tick_params(axis='both', which='major', labelsize=4)
hg.set_xlabel("Umbrella Limit", fontsize=5)
hg.set_ylabel("Count",fontsize=5)
sp=sns.scatterplot(data=numeric_data, x=integers_cg, y="CapitalGains", ax=ax2)
sp.tick_params(axis='both', which='major', labelsize=3.5)
sp.set_xlabel("Index", fontsize=5)
sp.set_ylabel("Umbrella Limit",fontsize=4)
bp=sns.boxplot(data=numeric_data, y="CapitalGains", ax=ax3)
bp.tick_params(axis='both', which='major', labelsize=5)
bp.set_xlabel("Unbrella Limit", fontsize=5)
bp.set_ylabel(None)
plt.tight_layout()
plt.show()
plt.clf()
## Capital Gains Desciption
## count 28695.000000
## mean 23074.225475
## std 27638.373450
## min 0.000000
## 25% 0.000000
## 50% 0.000000
## 75% 49000.000000
## max 100500.000000
## Name: CapitalGains, dtype: float64
## Ten smallest Capital Gain Amounts
## 4 0
## 5 0
## 11 0
## 12 0
## 14 0
## 15 0
## 16 0
## 17 0
## 18 0
## 23 0
## Name: CapitalGains, dtype: int32
## Ten Largest Capital Gain Amounts
## 593 100500
## 2064 100500
## 3000 100500
## 3274 100500
## 4642 100500
## 5423 100500
## 6093 100500
## 9284 100500
## 9420 100500
## 12750 100500
## Name: CapitalGains, dtype: int32
The scatter plot has points around 100000 that may be outliers though neither the histogram nor boxplot indicate this. The maximum value of 100,500 is substantially higher than the 75 percent quantile of 43000. The number 100,500 appears in all ten largest numbers thus it’s likely these are not outliers.
integers_cl=range(len(numeric_data["CapitalLoss"]))
gs=plt.GridSpec(2, 2)
fig=plt.figure(figsize=(10,8))
ax1=fig.add_subplot(gs[0, 0])
ax2=fig.add_subplot(gs[0, 1])
ax3=fig.add_subplot(gs[1, :])
sns.histplot(data = numeric_data, x = "CapitalLoss", ax=ax1)
sns.scatterplot(data=numeric_data, x=integers_cl, y="CapitalLoss", ax=ax2)
sns.boxplot(data=numeric_data, x="CapitalLoss", ax=ax3)
plt.tight_layout()
plt.show()
plt.clf()
print("Capital Loss Description\n", numeric_data["CapitalLoss"].describe())
## Capital Loss Description
## count 28695.000000
## mean -24942.289597
## std 27919.212327
## min -111100.000000
## 25% -50000.000000
## 50% 0.000000
## 75% 0.000000
## max 0.000000
## Name: CapitalLoss, dtype: float64
print("Ten smallest Capital Loss Amounts\n",numeric_data["CapitalLoss"].nsmallest(10))
## Ten smallest Capital Loss Amounts
## 583 -111100
## 584 -111100
## 1341 -111100
## 4691 -111100
## 5417 -111100
## 6658 -111100
## 11041 -111100
## 12724 -111100
## 12725 -111100
## 12726 -111100
## Name: CapitalLoss, dtype: int32
print("Ten Largest Capital Loss Amounts\n",numeric_data["CapitalLoss"].nlargest(10))
## Ten Largest Capital Loss Amounts
## 6 0
## 7 0
## 9 0
## 11 0
## 12 0
## 13 0
## 14 0
## 18 0
## 22 0
## 23 0
## Name: CapitalLoss, dtype: int32
The scatter plot has points near -100000 that may be outliers. This seems to be the case in the histogram as we see points in the tail that are possible outliers. We find a larger difference between the minimum value of 111,100 and 25 percent quantile. Checking the ten lowest values we see that the number 111,100 occupies all ten, thus it’s likely that these are not outliers.
integers_pd=range(len(numeric_data["Policy_Deductible"]))
gs=plt.GridSpec(2, 2)
fig=plt.figure(figsize=(10,8))
ax1=fig.add_subplot(gs[0, 0])
ax2=fig.add_subplot(gs[0, 1])
ax3=fig.add_subplot(gs[1, :])
sns.histplot(data = numeric_data, x = "Policy_Deductible", ax=ax1)
sns.scatterplot(data=numeric_data, x=integers_pd, y="Policy_Deductible", ax=ax2)
sns.boxplot(data=numeric_data, x="Policy_Deductible", ax=ax3)
plt.tight_layout()
plt.show()
plt.clf()
print("Policy Deductable description\n",numeric_data["Policy_Deductible"].describe())
## Policy Deductable description
## count 28695.000000
## mean 1114.250671
## std 546.567184
## min 500.000000
## 25% 622.000000
## 50% 1000.000000
## 75% 1625.500000
## max 2000.000000
## Name: Policy_Deductible, dtype: float64
## Ten smallest Policy Deductable Amounts
## 4 500
## 5 500
## 10 500
## 14 500
## 15 500
## 21 500
## 22 500
## 48 500
## 49 500
## 73 500
## Name: Policy_Deductible, dtype: int32
## Ten largest Policy Deductable Amounts
## 8 2000
## 18 2000
## 27 2000
## 28 2000
## 29 2000
## 33 2000
## 36 2000
## 37 2000
## 39 2000
## 40 2000
## Name: Policy_Deductible, dtype: int32
c
We’ll check our categorical features first by viewing their distributions. We will then use boxplots to determine if any categories for a categorical features is different from the other categories across all of our chosen numeric features. Categories that exhibit differences from other categories acrros all numeric features may be outliers.
fraud_v9=fraud_v8.copy()
fraud_v9=fraud_v9.drop(['IncidentCity','AmountOfInjuryClaim', 'AmountOfPropertyClaim', 'AmountOfVehicleDamage', 'VehicleModel',
'VehicleYOM', 'count', 'InsuredEducationLevel','InsuredOccupation', 'VehicleMake'], axis=1)
gs=plt.GridSpec(5, 3)
fig=plt.figure(figsize=(7,5))
fig.suptitle('Categorical Feature Distributions', fontsize=8)
ax1=fig.add_subplot(gs[0, 0])
ax2=fig.add_subplot(gs[0, 1])
ax3=fig.add_subplot(gs[0,2])
ax4=fig.add_subplot(gs[1,0])
ax5=fig.add_subplot(gs[1,1])
ax6=fig.add_subplot(gs[1,2])
ax7=fig.add_subplot(gs[2, 0])
ax8=fig.add_subplot(gs[2, 1])
ax9=fig.add_subplot(gs[2,2])
ax10=fig.add_subplot(gs[3,0])
ax11=fig.add_subplot(gs[3,1])
ax12=fig.add_subplot(gs[3,2])
ax13=fig.add_subplot(gs[4,0])
#plt.title('Type of Incident',fontsize=7, y=1)
ct1=sns.countplot(data = fraud_v9, y='TypeOfIncident', orient='h', ax=ax1)
ct1.tick_params(axis='both', which='major', labelsize=4)
ct1.set_xlabel(None)
ct1.set_ylabel("Type Of Incident", fontsize=5)
#plt.title('Type of Collision',fontsize=7, y=1)
ct2=sns.countplot(data = fraud_v9, y='TypeOfCollission', orient='h', ax=ax2)
ct2.tick_params(axis='both', which='major', labelsize=5)
ct2.set_ylabel("Type Of Collision", fontsize=5)
ct2.set_xlabel(None)
#plt.title('Reported Fraud',fontsize=7, y=1)
ct3=sns.countplot(data=fraud_v9, y='SeverityOfIncident', orient='h', ax=ax3)
ct3.tick_params(axis='both', which='major', labelsize=5)
ct3.set_ylabel("Severity Of Incident", fontsize=5)
ct3.set_xlabel(None)
ct4=sns.countplot(data=fraud_v9, y='AuthoritiesContacted', orient='h', ax=ax4)
ct4.tick_params(axis='both', which='major', labelsize=5)
ct4.set_ylabel("Authorities Contacted", fontsize=5)
ct4.set_xlabel(None)
ct5=sns.countplot(data=fraud_v9, y='IncidentState', orient='h', ax=ax5)
ct5.tick_params(axis='both', which='major', labelsize=5)
ct5.set_ylabel('Incident State',fontsize=5)
ct5.set_xlabel(None)
ct6=sns.countplot(data=fraud_v9,y='NumberOfVehicles', orient='h', ax=ax6)
ct6.tick_params(axis='both', which='major', labelsize=5)
ct6.set_ylabel("Number Of Vehicles ", fontsize=5)
ct6.set_xlabel(None)
ct7=sns.countplot(data = fraud_v9, y='BodilyInjuries', orient='h', ax=ax7)
ct7.tick_params(axis='both', which='major', labelsize=4)
ct7.set_xlabel(None)
ct7.set_ylabel("Bodily Injuries", fontsize=5)
#plt.title('Type of Collision',fontsize=7, y=1)
ct8=sns.countplot(data = fraud_v9, y='Witnesses', orient='h', ax=ax8)
ct8.tick_params(axis='both', which='major', labelsize=5)
ct8.set_ylabel("Witnesses", fontsize=5)
ct8.set_xlabel(None)
#plt.title('Reported Fraud',fontsize=7, y=1)
ct9=sns.countplot(data=fraud_v9, y='InsurancePolicyState', orient='h', ax=ax9)
ct9.tick_params(axis='both', which='major', labelsize=5)
ct9.set_ylabel("Insurance Policy State", fontsize=5)
ct9.set_xlabel(None)
ct10=sns.countplot(data=fraud_v9, y='Policy_CombinedSingleLimit', orient='h', ax=ax10)
ct10.tick_params(axis='both', which='major', labelsize=5)
ct10.set_ylabel("Polic _Combined/Single Limit", fontsize=4)
ct10.set_xlabel(None)
ct11=sns.countplot(data=fraud_v9, y='InsuredRelationship', orient='h', ax=ax11)
ct11.tick_params(axis='both', which='major', labelsize=5)
ct11.set_ylabel('Insured Relationship ',fontsize=5)
ct11.set_xlabel(None)
ct12=sns.countplot(data=fraud_v9,y='dayOfWeek', orient='h', ax=ax12)
ct12.tick_params(axis='both', which='major', labelsize=5)
ct12.set_ylabel("Day Of Week", fontsize=5)
ct12.set_xlabel(None)
ct13=sns.countplot(data=fraud_v9,y='IncidentPeriodDay', orient='h', ax=ax13)
ct13.tick_params(axis='both', which='major', labelsize=5)
ct13.set_ylabel("Incident Period Day", fontsize=5)
ct13.set_xlabel(None)
plt.tight_layout()
plt.show()
plt.clf()
gs=plt.GridSpec(2, 3)
fig=plt.figure(figsize=(8,6))
fig.suptitle('Type Of Incident', fontsize=8)
ax1=fig.add_subplot(gs[0, 0])
ax2=fig.add_subplot(gs[0, 1])
ax3=fig.add_subplot(gs[0,2])
ax4=fig.add_subplot(gs[1,0])
ax5=fig.add_subplot(gs[1,1])
ax6=fig.add_subplot(gs[1,2])
#plt.title('Type of Incident',fontsize=7, y=1)
bx1=sns.boxplot(data=fraud_v9, x='AmountOfTotalClaim', y='TypeOfIncident', orient='h', ax=ax1)
bx1.tick_params(axis='both', which='major', labelsize=4)
bx1.set_xlabel("Total Claim", fontsize=5)
bx1.set_ylabel(None)
#plt.title('Type of Collision',fontsize=7, y=1)
bx2=sns.boxplot(data=fraud_v9, x='InsuredAge', y='TypeOfIncident', orient='h', ax=ax2)
bx2.tick_params(axis='both', which='major', labelsize=5)
bx2.set_xlabel("Age", fontsize=5)
bx2.set_ylabel(None)
#plt.title('Reported Fraud',fontsize=7, y=1)
bx3=sns.boxplot(data=fraud_v9, x='CustomerLoyaltyPeriod', y='TypeOfIncident', orient='h', ax=ax3)
bx3.tick_params(axis='both', which='major', labelsize=5)
bx3.set_xlabel("Loyalty Period", fontsize=5)
bx3.set_ylabel(None)
bx4=sns.boxplot(data=fraud_v9, x='Policy_Deductible', y='TypeOfIncident', orient='h', ax=ax4)
bx4.tick_params(axis='both', which='major', labelsize=5)
bx4.set_xlabel("Deductable", fontsize=5)
bx4.set_ylabel(None)
bx5=sns.boxplot(data=fraud_v9, x='PolicyAnnualPremium', y='TypeOfIncident', orient='h', ax=ax5)
bx5.tick_params(axis='both', which='major', labelsize=5)
bx5.set_ylabel(None)
bx6=sns.boxplot(data=fraud_v9, x='UmbrellaLimit', y='TypeOfIncident', orient='h', ax=ax6)
bx6.tick_params(axis='both', which='major', labelsize=5)
bx6.set_xlabel("Umbrella Limit", fontsize=5)
bx6.set_ylabel(None)
plt.tight_layout()
plt.show()
plt.clf()
Type of Incidence Boxplots we find that the categories Parked car and Vehicle Theft for Total Claim are different than the other categories though this is not consistent with our other numeric features and thus cannot make any conclusions.
The category Trivial Damage for Severity of Incident is different from the other categories under Total claims though this is not consistent though the other numeric features.
No categorical features had categories that were different from the other categories across the numerical features.
Convert target variable to binary (Y -> 1, N -> 0)
out_mod1['ReportedFraud'] = out_mod1['ReportedFraud'].map({'Y': 1, 'N': 0})
out_mod2=out_mod1.copy()
out_mod2=out_mod2.drop('ReportedFraud', axis=1)
Identify categorical and numerical columns
categorical_cols = out_mod2.select_dtypes(include=['category']).columns.tolist()
numerical_cols = out_mod2.select_dtypes(include=['int32', 'float64']).columns.tolist()
One-Hot Encoding for categorical variables
encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False)
encoded_cats = encoder.fit_transform(out_mod2[categorical_cols])
encoded_df = pd.DataFrame(encoded_cats, columns=encoder.get_feature_names_out(categorical_cols))
Standardize numerical features
scaler = StandardScaler()
scaled_nums = scaler.fit_transform(out_mod2[numerical_cols])
scaled_df = pd.DataFrame(scaled_nums, columns=numerical_cols)
Combine processed numerical and categorical data
processed_df = pd.concat([scaled_df, encoded_df,out_mod1['ReportedFraud']], axis=1)
iso_forest = IsolationForest(n_estimators=100, contamination=0.05, random_state=42)
feature_columns = [col for col in processed_df.columns if col != 'ReportedFraud']
with contextlib.redirect_stderr(sys.stdout):
iso_forest.fit(processed_df[feature_columns])
IsolationForest(contamination=0.05, random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
IsolationForest(contamination=0.05, random_state=42)
with contextlib.redirect_stderr(sys.stdout):
# Get anomaly scores and predictions
processed_df_2['Anomaly_Score'] = iso_forest.decision_function(processed_df_2[feature_columns])
processed_df_2['Anomaly_Label'] = iso_forest.predict(processed_df_2[feature_columns])
processed_df_2['Anomaly_Label'] = processed_df_2['Anomaly_Label'].apply(lambda x: 1 if x == -1 else 0)
plt.figure(figsize=(10, 5))
plt.hist(processed_df_2['Anomaly_Score'], bins=50, alpha=0.7, color='blue', edgecolor='black')
plt.xlabel('Anomaly Score')
plt.ylabel('Frequency')
plt.title('Distribution of Anomaly Scores')
plt.show()
plt.clf()
The distribution of scores in the left tail show that the more anomalous observations have negative scores roughly between -0.04 and -0.06.
plt.figure(figsize=(10, 5))
plt.scatter(processed_df_2.index, processed_df_2['Anomaly_Score'], c=processed_df_2['Anomaly_Label'], cmap='coolwarm', alpha=0.6)
plt.xlabel('Observation Index')
plt.ylabel('Anomaly Score')
plt.title('Anomaly Score vs. Observations (Red = Anomalies)')
plt.colorbar(label="Anomaly Label (1 = Anomaly, 0 = Normal)")
## <matplotlib.colorbar.Colorbar object at 0x35c0d87f0>
plt.show()
plt.clf()
Observations with Label 1(Anomaly = yes) have anomaly scores under zero as opposed to observations with label 0 (Anomaly = no) which have scores above zero.
with contextlib.redirect_stderr(sys.stdout):
# Compare fraud cases vs anomaly detection
fraud_anomalies = processed_df_2.groupby(['ReportedFraud', 'Anomaly_Label']).size().unstack()
fraud_anomalies.plot(kind='bar', stacked=True, figsize=(8, 5))
plt.xlabel('Reported Fraud (0 = No, 1 = Yes)')
plt.ylabel('Count')
plt.title('Comparison of Reported Fraud vs Anomaly Detection')
plt.legend(title="Anomaly Label", labels=['Normal', 'Anomaly'])
plt.show()
plt.clf()
from sklearn.tree import DecisionTreeClassifier
# Feature importance using Decision Tree as a surrogate model
tree_model = DecisionTreeClassifier(random_state=42)
tree_model.fit(processed_df_2[feature_columns], processed_df_2['Anomaly_Label'])
DecisionTreeClassifier(random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
DecisionTreeClassifier(random_state=42)
feature_importance = pd.DataFrame({
'Feature': feature_columns,
'Importance': tree_model.feature_importances_
}).sort_values(by='Importance', ascending=False).round(3)
feat_out_tp5=feature_importance.nlargest(5,"Importance")
values = feat_out_tp5.Importance
idx = feat_out_tp5.Feature
plt.figure(figsize=(12,10))
clrs = ['green' if (x < max(values)) else 'red' for x in values ]
sns.barplot(y=idx,x=values,palette=clrs).set(title='Important features Random Forest Anomaly Model')
plt.ylabel("Features", fontsize=510)
plt.tick_params(axis='x', which='major', labelsize=9)
plt.tick_params(axis='y', labelsize=7,labelrotation=45)
plt.show()
plt.clf()
Feature Importance is a score assigned to features that defines how “important” a feature is to the model’s prediction. This means the extent to which the feature contributes to the final output. However, feature importance does not inform us if the contribution is a positive or negative impact on the final output.
In our inspection of these top features in the visualization section, the distribution between fraud and no fraud did not show differences in distribution. In order to thoroughly perform anomaly detection, we woujld include other model algorithms. Since the focus of this project is predicting fraud, anomaly detection will be a seperate project.
Before model building can start, we’ll need to perform pre-processing. This will entail splitting our data into training, validation, and test sets along with transforming numerical and categorical features into classification friendly formats.
Select features
## The Target categories: Index(['N', 'Y'], dtype='object'):
Will relocate ReportedFraud feature to the last index of the model_data
col=model_data.pop('ReportedFraud')
model_data.insert(22,'ReportedFraud', col)
We will separate the data to get predictor features and target features into separate data frames.
model_data2=model_data2.rename(columns={'ReportedFraud': 'labels'})
The data type of the target feature is categorical. Most machine learning algorithms require numerical data types. The target feature y will transformed to a numeric tpye.
label_encoder=LabelEncoder()
def split_data(data):
y=data.iloc[:, -1]
y=pd.DataFrame(y)
y['labels']=label_encoder.fit_transform(y['labels'])
y['labels']=y['labels'].astype("category")
X=data.iloc[:, :-1]
return X, y
X,y=split_data(model_data2)
## CategoricalDtype(categories=[0, 1], ordered=False)
## Target Feature categories as binary: Int64Index([0, 1], dtype='int64'):
## Shape of Predictor Features is (28181, 22):
## Shape of Target Feature is (28181, 1):
The makup of the X data frame is 28836 rows and 26 columns. The y data frame has the same number of rows, 28836, and one column, the target feature.
## *********** X Structure***********
## <class 'pandas.core.frame.DataFrame'>
## Int64Index: 28181 entries, 0 to 28835
## Data columns (total 23 columns):
## # Column Non-Null Count Dtype
## --- ------ -------------- -----
## 0 TypeOfIncident 28181 non-null category
## 1 TypeOfCollission 28181 non-null category
## 2 SeverityOfIncident 28181 non-null category
## 3 AuthoritiesContacted 28181 non-null category
## 4 IncidentState 28181 non-null category
## 5 NumberOfVehicles 28181 non-null category
## 6 BodilyInjuries 28181 non-null category
## 7 Witnesses 28181 non-null category
## 8 AmountOfTotalClaim 28181 non-null int32
## 9 InsuredAge 28181 non-null int32
## 10 InsuredGender 28181 non-null category
## 11 CapitalGains 28181 non-null int32
## 12 CapitalLoss 28181 non-null int32
## 13 CustomerLoyaltyPeriod 28181 non-null int32
## 14 InsurancePolicyState 28181 non-null category
## 15 Policy_CombinedSingleLimit 28181 non-null category
## 16 Policy_Deductible 28181 non-null int32
## 17 PolicyAnnualPremium 28181 non-null float64
## 18 UmbrellaLimit 28181 non-null int32
## 19 InsuredRelationship 28181 non-null category
## 20 coverageIncidentDiff 28181 non-null float64
## 21 dayOfWeek 28181 non-null category
## 22 IncidentPeriodDay 28181 non-null category
## dtypes: category(14), float64(2), int32(7)
## memory usage: 1.8 MB
We will now split X,y into Train and Test sets
X_train, X_test, y_train, y_test=train_test_split(X,y, test_size=0.25,random_state=42, stratify=y)
## Shape of X Train: (21135, 23)
## Shape of X Test: (7046, 23)
## Shape of y Train: (21135, 1)
## Shape of y Test: (7046, 1)
y train and y test will be transformed into one dimensional arrays
using a function.
def transform_to_array(y_train, y_test):
#transform from data frame to numpy array
y_train_array=np.array(y_train)
y_test_array=np.array(y_test)
#transform to one dimensianal array
y_train_np=np.ravel(y_train_array)
y_test_np=np.ravel(y_test_array)
return y_train_np, y_test_np
y_train_np, y_test_np=transform_to_array(y_train, y_test)
## Shape of y Train np: (21135,)
## Shape of y Test np: (7046,)
From the above output we see that y train and y test have been transformed into one dimensional numpy arrays.
Our next step is to transform the predictor features into acceptable machine learning formats.
Transformation for numerical features is performed by scaling. Scaling prevents a feature with a range let’s say in the thousands from being considered more important than a feature having a lower range. Scaling places features at the same importance before being applied to a machine learning algorithm. There are different methods used in scaling features, for this analysis we’ll be using standard scaling. Standard scaling transforms the data to have zero mean and a variance of one, thus making the data unitless.
Most machine learning algorithms only accept numerical features which makes categorical features unacceptable in their original form. Thus, we need to encode categorical features into numerical values. The act of replacing categories with numbers is called categorical encoding. For this we will use one-hot encoding. Categorical features are represented as a group of binary features, where each binary feature represents one category. The binary feature takes the integer value 1 if the category is present, or 0 otherwise.
set_config configures pre-processing steps such as Standard Scaler and One Hot Encoding to return a Pandas Data Frame
set_config(transform_output="pandas")
def define_columns(X_train):
categorical= list(X_train.select_dtypes('category').columns)
numerical = list(X_train.select_dtypes('number').columns)
return categorical, numerical
categorical, numerical=define_columns(X_train)
First, we will create a function which will transform train and test sets for the Logistic Regression model. This entails dropping the first category of each feature during One Hot Encoding.
def transform_x_columns(X_train, X_test):
ct_lr=ColumnTransformer(
transformers=[
('scale',StandardScaler(), numerical),
('ohe',OneHotEncoder(handle_unknown='ignore', sparse_output=False, drop='first'), categorical)])
X_train_lr=ct_lr.fit_transform(X_train)
X_test_lr=ct_lr.transform(X_test)
return X_train_lr, X_test_lr
with contextlib.redirect_stderr(sys.stdout):
X_train_lr, X_test_lr=transform_x_columns(X_train, X_test)
## ********************First Five Rows X_train_lr********************
## scale__AmountOfTotalClaim scale__InsuredAge scale__CapitalGains
## 3409 -1.852393 -1.614806 1.582364
## 19726 -0.207519 -0.234876 1.636523
## 9912 0.171842 1.395950 1.174367
## 5686 -1.839096 0.266916 1.156314
## 11372 -1.850651 1.395950 1.022723
## ********************First Five Rows X_test_lr********************
## scale__AmountOfTotalClaim scale__InsuredAge scale__CapitalGains
## 7225 -0.329962 -1.238462 1.224916
## 15229 -0.220341 -0.611221 1.062439
## 24504 0.747730 -0.987565 1.037165
## 14811 1.059300 -0.485773 -0.836730
## 14954 0.421042 -0.987565 1.701513
We will confirm that the columns of X_train_lr and X_test_lr are the same count after transformation
# Function to check column count
def check_columns_equal(df1, df2):
assert df1.shape[1] == df2.shape[1], f"Error: Column counts do not match. df1 has {df1.shape[1]} columns, df2 has {df2.shape[1]} columns."
print("Columns are equal.")
check_columns_equal(X_train_lr, X_test_lr)
## Columns are equal.
We see from the first five rows of the train, valid, and test sets that the features have been transformed while at the same time retaining the column feature names.
## Shape of X Train lr: (21135, 63)
## Shape of X Test lr: (7046, 63)
Next, we transform training, valid, and test sets for all other models. During One Hot Encoding, the first category will be dropped only if the feature is binary.
def transform_x_columns_tr(train, test):
ct_tr=ColumnTransformer(
transformers=[
('num',StandardScaler(), numerical),
('cat',OneHotEncoder(handle_unknown='ignore', sparse_output=False, drop='if_binary'), categorical)])
train_tr=ct_tr.fit_transform(train)
test_tr=ct_tr.transform(test)
return train_tr, test_tr
X_train_tr, X_test_tr=transform_x_columns_tr(X_train, X_test)
## ************First Five Rows X_train_tr************
## num__AmountOfTotalClaim num__InsuredAge num__CapitalGains
## 3409 -1.852393 -1.614806 1.582364
## 19726 -0.207519 -0.234876 1.636523
## 9912 0.171842 1.395950 1.174367
## 5686 -1.839096 0.266916 1.156314
## 11372 -1.850651 1.395950 1.022723
## ********************First Five Rows X_test_tr********************
## num__AmountOfTotalClaim num__InsuredAge num__CapitalGains
## 7225 -0.329962 -1.238462 1.224916
## 15229 -0.220341 -0.611221 1.062439
## 24504 0.747730 -0.987565 1.037165
## 14811 1.059300 -0.485773 -0.836730
## 14954 0.421042 -0.987565 1.701513
We will confirm that the X_train_tr and X_test_tr columns count are equal
check_columns_equal(X_train_tr, X_test_tr)
## Columns are equal.
## Shape of X Train tr: (21135, 76)
## Shape of X Test tr: (7046, 76)
From the shape output we find there are 13 additional columns compared to the logistic regression transformed data.
For evaluating model performance, the event of interest we are interested in is if reported fraud is yes. This is considered the positive class. Classification metrics are used to determine how well our models predict the event of interest.
Accuracy-measures the number of predictions that are correct as a percentage of the total number of predictions that are made. As an example, if 90% of your predictions are correct, your accuracy is simply 90%. Calculation: number of correct predictions/Number of total predictions. TP+TN/(TP+TN+FP+FN)
Precision-tells us about the quality of positive predictions. It may not find all the positives but the ones that the model does classify as positive are very likely to be correct. As an example, out of everyone predicted to have defaulted, how many of them did default? So, within everything that has been predicted as a positive, precision counts the percentage that is correct. Calculation: True positives/All Positives. TP/(TP+FP)
Recall- tells us about how well the model identifies true positives. The model may find a lot of positives, yet it also will wrongly detect many positives that are not actually positives. Out of all the patients who have the disease, how many were correctly identified? So, within everything that is positive, how many did the model successfully to find. A model with low recall is not able to find all (or a large part) of the positive cases in the data. Calculated as: True Positives/(False Negatives + True Positives)
F1 Score-The F1 score is defined as the harmonic mean of precision and recall.
The harmonic mean is an alternative metric for the more common arithmetic mean. It is often useful when computing an average rate. https://en.wikipedia.org/wiki/Harmonic_mean
The formula for the F1 score is the following: 2 times((precision*Recall)/(Precision + Recall))
Since the F1 score is an average of Precision and Recall, it means that the F1 score gives equal weight to Precision and Recall:
recall_scorer = make_scorer(recall_score, pos_label=1)
precision_scorer = make_scorer(precision_score, pos_label=1)
roc_auc_scorer = make_scorer(roc_auc_score)
skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=1)
lr_base_clf=logreg.fit(X_train_lr, y_train_np)
start_time = time.time()
lr_base_cv_accuracy=cross_val_score(lr_base_clf, X_train_lr, y_train_np, cv=skf, scoring='accuracy').mean().round(2)
log_CrossValAccurBase_time = time.time() - start_time
start_time = time.time()
lr_base_cv_recall_score=cross_val_score(lr_base_clf, X_train_lr, y_train_np,
scoring=recall_scorer, cv=skf, n_jobs=-1).mean().round(2)
log_CrossValRecallBase_time = time.time() - start_time
start_time = time.time()
lr_base_cv_precision_score=cross_val_score(lr_base_clf, X_train_lr, y_train_np,
scoring=precision_scorer, cv=skf, n_jobs=-1).mean().round(2)
log_CrossValPrecBase_time = time.time() - start_time
start_time = time.time()
lr_base_cv_auc_score=cross_val_score(lr_base_clf, X_train_lr, y_train_np,
scoring=roc_auc_scorer, cv=skf, n_jobs=-1).mean().round(2)
log_CrossValAUCBase_time = time.time() - start_time
start_time = time.time()
lr_base_cv_f1=cross_val_score(lr_base_clf, X_train_lr, y_train_np, cv=skf, scoring='f1').mean().round(2)
log_CrossValF1Base_time = time.time() - start_time
lr=LogisticRegression(random_state=1)
lr_params={
'C': [0.0001,0.001, 0.01, 0.1, 1, 10],
'penalty': ['l2'],
'max_iter': list(range(5000,40000, 5000)),
'solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']
}
lr_search=RandomizedSearchCV(lr, lr_params, refit=True,
verbose=3,cv=5,n_iter=6,scoring='roc_auc',return_train_score=True, n_jobs=-1)
start_time = time.time()
lr_search.fit(X_train_lr, y_train_np)
RandomizedSearchCV(cv=5, estimator=LogisticRegression(random_state=1), n_iter=6,
n_jobs=-1,
param_distributions={'C': [0.0001, 0.001, 0.01, 0.1, 1, 10],
'max_iter': [5000, 10000, 15000, 20000,
25000, 30000, 35000],
'penalty': ['l2'],
'solver': ['newton-cg', 'lbfgs',
'liblinear', 'sag',
'saga']},
return_train_score=True, scoring='roc_auc', verbose=3)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. RandomizedSearchCV(cv=5, estimator=LogisticRegression(random_state=1), n_iter=6,
n_jobs=-1,
param_distributions={'C': [0.0001, 0.001, 0.01, 0.1, 1, 10],
'max_iter': [5000, 10000, 15000, 20000,
25000, 30000, 35000],
'penalty': ['l2'],
'solver': ['newton-cg', 'lbfgs',
'liblinear', 'sag',
'saga']},
return_train_score=True, scoring='roc_auc', verbose=3)LogisticRegression(random_state=1)
LogisticRegression(random_state=1)
log_grid_training_time = time.time() - start_time
lr_cv_results=pd.DataFrame(lr_search.cv_results_)
lr_cv_results[['mean_train_score', 'std_train_score','mean_test_score', 'std_test_score']].mean()
## mean_train_score 0.769316
## std_train_score 0.001226
## mean_test_score 0.765394
## std_test_score 0.004892
## dtype: float64
Mean train and test scores from cross validation indicate no
over-fitting or under-fitting.
lr_clf=lr_search.best_estimator_
LogisticRegression(C=1, max_iter=30000, random_state=1, solver='newton-cg')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LogisticRegression(C=1, max_iter=30000, random_state=1, solver='newton-cg')
The above displays gives us the parameters chosen for the logistic regression model.
start_time = time.time()
lr_cv_accuracy=cross_val_score(lr_clf, X_train_lr, y_train_np,
scoring='accuracy', cv=skf, n_jobs=-1).mean().round(2)
log_CrossValAccur_time = time.time() - start_time
start_time = time.time()
lr_cv_f1_score=cross_val_score(lr_clf, X_train_lr, y_train_np,
scoring='f1', cv=skf, n_jobs=-1).mean().round(2)
log_CrossValF1_time = time.time() - start_time
start_time = time.time()
lr_cv_recall_score=cross_val_score(lr_clf, X_train_lr, y_train_np,
scoring=recall_scorer, cv=skf, n_jobs=-1).mean().round(2)
log_CrossValRecall_time = time.time() - start_time
start_time = time.time()
lr_cv_precision_score=cross_val_score(lr_clf, X_train_lr, y_train_np,
scoring=precision_scorer, cv=skf, n_jobs=-1).mean().round(2)
log_CrossValPrec_time = time.time() - start_time
start_time = time.time()
lr_cv_auc_score=cross_val_score(lr_clf, X_train_lr, y_train_np,
scoring=roc_auc_scorer, cv=skf, n_jobs=-1).mean().round(2)
log_CrossValAuc_time = time.time() - start_time
log_cross_val_Time=(log_CrossValAccur_time+log_CrossValRecall_time+log_CrossValF1_time+ log_CrossValPrec_time+log_CrossValAuc_time)/5
cm_lr = metrics.confusion_matrix(y_test_np, y_pred_lr, labels=[0,1])
df_cm_lr = pd.DataFrame(cm_lr, index=["Actual - No", "Actual - Yes"], columns=["Predicted - No", "Predicted - Yes"])
group_counts = ["{0:0.0f}".format(value) for value in cm_lr.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in cm_lr.flatten()/np.sum(cm_lr)]
labels = [f"{v1}\n{v2}" for v1, v2 in zip(group_counts, group_percentages)]
labels = np.asarray(labels).reshape(2,2)
plt.figure(figsize=(9,6))
sns.heatmap(df_cm_lr, annot=labels, fmt='')
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.title("Confusion Matrix-Logistic Regression", fontsize=14)
plt.show()
plt.clf()
The confusion matrix plot displays the performance of a classifier. Accurate fraud predictions of Yes (True Positive) are located at the right-bottom of the matrix. Inaccurate fraude predictions of Yes (False Positive) are located at the top-right of the matrix. Accurate fraud predictions of No (True Negative) are located at the left-bottom of the matrix. Inaccurate fraud predictions of No (False Negatives) are located at the right-top of the matrix.
We see from the confusion matrix that 13.84% of fraud predictions were accurately predicted as Yes compared to 7.41% that were inaccurately predicted as yes.
We will now look at Feature Importance. Feature Importance is a score assigned to the features of a Machine Learning model that defines how “important” is a feature to the model’s prediction. This means the extent to which the feature contributes to the final output. However, feature importance does not inform us if the contribution is a positive or negative impact on the final output.
feature_importance_lr=pd.DataFrame({'feature':list(X_test_lr.columns),'feature_importance':[abs(i) for i in lr_clf.coef_[0]]})
feature_importance_lr=feature_importance_lr.sort_values('feature_importance',ascending=False)
For the logistical regression model we took the absolute value of the coefficients so as to get the Importance of the feature both with negative and positive effect.
Now that we have the importance of the features, we will now transform the coefficients for easier interpretation. The coefficients are in log odds format. We will transform them to odds-ratio format.
#Combine feature names and coefficients on top Pandas DataFrame
feature_names_lr=pd.DataFrame(X_test_lr.columns, columns=['Feature'])
log_coef=pd.DataFrame(np.transpose(lr_clf.coef_), columns=['Coefficent'])
coefficients=pd.concat([feature_names_lr, log_coef], axis=1)
#Calculate exponent of the logistic regression coefficients
coefficients['Exp_Coefficient']=np.exp(coefficients['Coefficent'])
#Remove coefficients that are equal to zero.
coefficients=coefficients[coefficients['Exp_Coefficient']>=1]
coefficients_tp5=coefficients.nlargest(5,"Exp_Coefficient")
## ******************Top Five Coefficients******************
## Feature Exp_Coefficient
## 50 ohe__InsuredRelationship_unmarried 1.608058
## 34 ohe__Witnesses_2 1.557392
## 48 ohe__InsuredRelationship_other-relative 1.528998
## 47 ohe__InsuredRelationship_not-in-family 1.487331
## 58 ohe__IncidentPeriodDay_early morning 1.410756
Three levels of Insured Relationtionship are in the top five along with the level of witnesses=1 and period of day=morning.
Support Vector Machine (the “road machine”) is responsible for finding the decision boundary to separate different classes and maximize the margin. A decision boundary differentiates two classes. A data point falling on either side of the decision boundary can be attributed to different classes. Binary classes would be either yes or no.
from sklearn.svm import SVC
svc=SVC(random_state=1, kernel="rbf")
svc_base_clf=svc.fit(X_train_lr, y_train_np)
start_time = time.time()
svc_base_cv_accuracy=cross_val_score(svc_base_clf, X_train_lr, y_train_np, cv=skf, scoring='accuracy').mean().round(2)
svc_CrossValAccurBase_time = time.time() - start_time
start_time = time.time()
svc_base_cv_recall=cross_val_score(svc_base_clf, X_train_lr, y_train_np, cv=skf, scoring=recall_scorer, n_jobs=-1).mean().round(2)
svcBase_CrossValRcall = time.time() - start_time
start_time = time.time()
svc_base_cv_precision=cross_val_score(svc_base_clf, X_train_lr, y_train_np, cv=skf, scoring=precision_scorer, n_jobs=-1).mean().round(2)
svcBase_CrossValprec = time.time() - start_time
start_time = time.time()
svc_base_cv_f1=cross_val_score(svc_base_clf, X_train_lr, y_train_np, cv=skf, scoring='f1', n_jobs=-1).mean().round(2)
svcBase_CrossValF1 = time.time() - start_time
start_time = time.time()
svc_base_cv_auc_score=cross_val_score(svc_base_clf, X_train_lr, y_train_np,
scoring=roc_auc_scorer, cv=skf, n_jobs=-1).mean().round(2)
svcBase_CrossValAuc = time.time() - start_time
svcBase_cross_val_Time=(svc_CrossValAccurBase_time+svcBase_CrossValRcall+svcBase_CrossValF1+ svcBase_CrossValprec+svcBase_CrossValAuc)/5
param_grid_svc = {'C': [0.0001,.001,.01,1, 10, 100], 'gamma': [1,0.1,0.01,0.001, .0001]}
grid_svc=RandomizedSearchCV(svc,param_grid_svc, refit=True,
verbose=3,cv=5,n_iter=6, scoring='roc_auc',return_train_score=True, n_jobs=-1)
start_time = time.time()
grid_svc.fit(X_train_lr, y_train_np)
RandomizedSearchCV(cv=5, estimator=SVC(random_state=1), n_iter=6, n_jobs=-1,
param_distributions={'C': [0.0001, 0.001, 0.01, 1, 10, 100],
'gamma': [1, 0.1, 0.01, 0.001, 0.0001]},
return_train_score=True, scoring='roc_auc', verbose=3)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. RandomizedSearchCV(cv=5, estimator=SVC(random_state=1), n_iter=6, n_jobs=-1,
param_distributions={'C': [0.0001, 0.001, 0.01, 1, 10, 100],
'gamma': [1, 0.1, 0.01, 0.001, 0.0001]},
return_train_score=True, scoring='roc_auc', verbose=3)SVC(random_state=1)
SVC(random_state=1)
svc_grid_training_time = time.time() - start_time
svc_cv_results=pd.DataFrame(grid_svc.cv_results_)
svc_cv_results[['mean_train_score', 'std_train_score','mean_test_score', 'std_test_score']].mean()
## mean_train_score 0.866267
## std_train_score 0.002293
## mean_test_score 0.824598
## std_test_score 0.004312
## dtype: float64
cross validation score results show the mean train score is .03 higher then an the mean test score which may indicate small over fitting. Applying the classifier to the test data will help clarify this.
svc_clf=grid_svc.best_estimator_
SVC(C=10, gamma=0.1, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
SVC(C=10, gamma=0.1, random_state=1)
The above displays gives us the parameters chosen for the support vector machine Model.
start_time = time.time()
svc_cv_accuracy=cross_val_score(svc_clf, X_train_lr, y_train_np,
scoring='accuracy', cv=skf, n_jobs=-1).mean().round(2)
svc_CrossValAccur_time = time.time() - start_time
start_time = time.time()
svc_cv_f1_score=cross_val_score(svc_clf, X_train_lr, y_train_np,
scoring='f1', cv=skf, n_jobs=-1).mean().round(2)
svc_CrossValF1_time = time.time() - start_time
start_time = time.time()
svc_cv_recall_score=cross_val_score(svc_clf, X_train_lr, y_train_np,
scoring=recall_scorer, cv=skf, n_jobs=-1).mean().round(2)
svc_CrossValRecall_time = time.time() - start_time
start_time = time.time()
svc_cv_precision=cross_val_score(svc_clf, X_train_lr, y_train_np,
scoring=precision_scorer, cv=skf, n_jobs=-1).mean().round(2)
svc_CrossValPrec_time = time.time() - start_time
start_time = time.time()
svc_cv_auc=cross_val_score(svc_clf, X_train_lr, y_train_np,
scoring=roc_auc_scorer, cv=skf, n_jobs=-1).mean().round(2)
svc_CrossValAuc_time = time.time() - start_time
y_predBase_svc=svc_base_clf.predict(X_test_lr)
print('******SVC Classification Report******')
## ******SVC Classification Report******
print(classification_report(y_test_np, y_predBase_svc))
## precision recall f1-score support
##
## 0 0.91 0.97 0.94 5137
## 1 0.90 0.74 0.81 1909
##
## accuracy 0.91 7046
## macro avg 0.90 0.85 0.87 7046
## weighted avg 0.91 0.91 0.90 7046
svc_AccuracyBase_test=roc_auc_score(y_test_np, y_predBase_svc).round(2)
cm_svc = metrics.confusion_matrix(y_test_np, y_predBase_svc, labels=[0,1])
df_cm_svc = pd.DataFrame(cm_svc, index=["Actual - No", "Actual - Yes"], columns=["Predicted - No", "Predicted - Yes"])
group_counts = ["{0:0.0f}".format(value) for value in cm_svc.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in cm_svc.flatten()/np.sum(cm_svc)]
labels = [f"{v1}\n{v2}" for v1, v2 in zip(group_counts, group_percentages)]
labels = np.asarray(labels).reshape(2,2)
plt.figure(figsize=(9,6))
sns.heatmap(df_cm_svc, annot=labels, fmt='')
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.title("Confusion Matrix-Support Vector M<achine", fontsize=14)
plt.show()
plt.clf()
Our SVC classifier performed better on accurately predicting fraud of
yes (20%) than the logistic regression classfier (13.84%). Additionaly,
the svc classifier inaccurate fraud predictions of yes was only 2.28%
compated to the logistic regression model’s 7.41%
Random forest is an ensemble learning method. Ensemble learning takes predictions from multiple models are merges them to enhance the accuracy of prediction. There are four types of ensemble techniques. We’ll be using Bagging (which random forest is an example of) and boosting, which our next four models will be an example of.
Bagging involves fitting many decision trees on different samples of the same dataset and averaging the predictions.
Random Forest models are made up of individual decision trees whose predictions are combined for a final result. The final result is decided using majority rules which means that the final prediction is what the majority of the decision tree models chose. An example would be 5 models in which 3 of the 5 models predict ‘yes’ for the classification problem.
Random Forests can be made up of thousands of decision trees.
Simply put, the random forest builds multiple decision trees and merges them together to get a more accurate prediction.
from sklearn.ensemble import RandomForestClassifier
rf=RandomForestClassifier(random_state=1,n_jobs=-1)
rf_base_clf=rf.fit(X_train_tr, y_train_np)
start_time = time.time()
rf_base_cv_accuracy=cross_val_score(rf_base_clf, X_train_tr, y_train_np, cv=skf, scoring='accuracy').mean().round(2)
rfBase_CrossValAccur = time.time() - start_time
start_time = time.time()
rf_base_cv_recall=cross_val_score(rf_base_clf, X_train_tr, y_train_np, cv=skf, scoring='recall').mean().round(2)
rfBase_CrossValRecall = time.time() - start_time
start_time = time.time()
rf_base_cv_precision=cross_val_score(rf_base_clf, X_train_tr, y_train_np, cv=skf, scoring='precision').mean().round(2)
rfBase_CrossValPrec = time.time() - start_time
start_time = time.time()
rf_base_cv_f1=cross_val_score(rf_base_clf, X_train_tr, y_train_np, cv=skf, scoring='f1').mean().round(2)
rfBase_CrossValF1 = time.time() - start_time
start_time = time.time()
rf_base_cv_auc=cross_val_score(rf_base_clf, X_train_tr, y_train_np, cv=skf, scoring=roc_auc_scorer).mean().round(2)
rfBase_CrossValAuc = time.time() - start_time
rf_params={'n_estimators':[500,1000,22500,5000],
'max_features':[0.25,0.50,0.75,1.0],
'min_samples_split':[2,4,6,8],
#'max_depth': [500, 1000, 2000, 4000,6000],
'max_depth': list(range(500,15000,500)),
'min_samples_leaf': [3, 4, 5, 6],
'criterion': ['gini', 'entropy', 'log_loss']}
rf_search=RandomizedSearchCV(rf, rf_params, n_iter=6,refit=True,
verbose=3,cv=5, scoring='roc_auc',return_train_score=True, n_jobs=-1)
start_time = time.time()
rf_search.fit(X_train_tr, y_train_np)
RandomizedSearchCV(cv=5,
estimator=RandomForestClassifier(n_jobs=-1, random_state=1),
n_iter=6, n_jobs=-1,
param_distributions={'criterion': ['gini', 'entropy',
'log_loss'],
'max_depth': [500, 1000, 1500, 2000,
2500, 3000, 3500, 4000,
4500, 5000, 5500, 6000,
6500, 7000, 7500, 8000,
8500, 9000, 9500, 10000,
10500, 11000, 11500,
12000, 12500, 13000,
13500, 14000, 14500],
'max_features': [0.25, 0.5, 0.75, 1.0],
'min_samples_leaf': [3, 4, 5, 6],
'min_samples_split': [2, 4, 6, 8],
'n_estimators': [500, 1000, 22500,
5000]},
return_train_score=True, scoring='roc_auc', verbose=3)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. RandomizedSearchCV(cv=5,
estimator=RandomForestClassifier(n_jobs=-1, random_state=1),
n_iter=6, n_jobs=-1,
param_distributions={'criterion': ['gini', 'entropy',
'log_loss'],
'max_depth': [500, 1000, 1500, 2000,
2500, 3000, 3500, 4000,
4500, 5000, 5500, 6000,
6500, 7000, 7500, 8000,
8500, 9000, 9500, 10000,
10500, 11000, 11500,
12000, 12500, 13000,
13500, 14000, 14500],
'max_features': [0.25, 0.5, 0.75, 1.0],
'min_samples_leaf': [3, 4, 5, 6],
'min_samples_split': [2, 4, 6, 8],
'n_estimators': [500, 1000, 22500,
5000]},
return_train_score=True, scoring='roc_auc', verbose=3)RandomForestClassifier(n_jobs=-1, random_state=1)
RandomForestClassifier(n_jobs=-1, random_state=1)
rf_grid_training_time = time.time() - start_time
rf_cv_results=pd.DataFrame(rf_search.cv_results_)
rf_cv_results[['mean_train_score', 'std_train_score','mean_test_score', 'std_test_score']].mean()
## mean_train_score 0.993862
## std_train_score 0.000114
## mean_test_score 0.905631
## std_test_score 0.005357
## dtype: float64
Cross validation mean score for train is .09 higher than the mean test score. This could be a case of overfitting. Applying the classifier to test data will provide more metrics to help us.
RandomForestClassifier(criterion='entropy', max_depth=14500, max_features=0.5,
min_samples_leaf=3, min_samples_split=6,
n_estimators=500, n_jobs=-1, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. RandomForestClassifier(criterion='entropy', max_depth=14500, max_features=0.5,
min_samples_leaf=3, min_samples_split=6,
n_estimators=500, n_jobs=-1, random_state=1)The above display presents the parameters chosen for the random Forest classifier.
start_time = time.time()
rf_cv_accuracy=cross_val_score(rf_clf, X_train_tr, y_train_np,
scoring='accuracy', cv=skf, n_jobs=-1).mean().round(2)
rf_CrossAccur_time = time.time() - start_time
start_time = time.time()
rf_cv_recall=cross_val_score(rf_clf, X_train_tr, y_train_np,
scoring=recall_scorer,cv=skf, n_jobs=-1).mean().round(2)
rf_CrossValRecall_time = time.time() - start_time
start_time = time.time()
rf_cv_precision=cross_val_score(rf_clf, X_train_tr, y_train_np,
scoring=precision_scorer, cv=skf, n_jobs=-1).mean().round(2)
rf_CrossValPrec_time = time.time() - start_time
start_time = time.time()
rf_cv_f1=cross_val_score(rf_clf, X_train_tr, y_train_np,
scoring='f1', cv=skf, n_jobs=-1).mean().round(2)
rf_CrossValF1_time = time.time() - start_time
start_time = time.time()
rf_cv_auc=cross_val_score(rf_clf, X_train_tr, y_train_np,
scoring=roc_auc_scorer, cv=skf, n_jobs=-1).mean().round(2)
rf_CrossValAuc_time = time.time() - start_time
We’ll now use the classifier to predict on the test data.
y_pred_Base_rf=rf_base_clf.predict(X_test_tr)
## *******Random Forest Classification Report********
## precision recall f1-score support
##
## 0 0.91 0.97 0.94 5137
## 1 0.91 0.73 0.81 1909
##
## accuracy 0.91 7046
## macro avg 0.91 0.85 0.87 7046
## weighted avg 0.91 0.91 0.90 7046
rfBase_recall_test=recall_score(y_test_np, y_pred_Base_rf, pos_label=1).round(2)
rfBase_roc_test=roc_auc_score(y_test_np, y_pred_Base_rf).round(2)
rfBase_precision_test=precision_score(y_test_np, y_pred_Base_rf, pos_label=1).round(2)
rfBase_test_accuracy=accuracy_score(y_test_np, y_pred_Base_rf).round(2)
rfBase_test_f1=f1_score(y_test_np, y_pred_Base_rf).round(2)
cm_rf_vl = metrics.confusion_matrix(y_test_np, y_pred_Base_rf, labels=[0,1])
df_cm_rf_vl = pd.DataFrame(cm_rf_vl, index=["Actual - No", "Actual - Yes"], columns=["Predicted - No", "Predicted - Yes"])
group_counts = ["{0:0.0f}".format(value) for value in cm_rf_vl.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in cm_rf_vl.flatten()/np.sum(cm_rf_vl)]
labels = [f"{v1}\n{v2}" for v1, v2 in zip(group_counts, group_percentages)]
labels = np.asarray(labels).reshape(2,2)
plt.figure(figsize=(9,9))
sns.heatmap(df_cm_rf_vl, annot=labels, fmt='')
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.title("Confusion Matrix-Decision Tree", fontsize=14)
plt.show()
plt.clf()
The performance of our random forest classifier at accurately predicting fraud of Yes is 19.66%, only slightly less than the SVC’s 20%. Random Forest did have a slightly lower percentage of inaccurately predicing fraud of yes (2.06%)
# let's create a dictionary of features and their importance values
feat_dict_rf= {}
for col, val in sorted(zip(X_train_tr.columns, rf_base_clf.feature_importances_),key=lambda x:x[1],reverse=True):
feat_dict_rf[col]=val
feat_rf_df = pd.DataFrame({'Feature':feat_dict_rf.keys(),'Importance':feat_dict_rf.values()})
feat_rf_tp5=feat_rf_df.nlargest(5,"Importance")
values = feat_rf_tp5.Importance
idx = feat_rf_tp5.Feature
plt.figure(figsize=(12,10))
clrs = ['green' if (x < max(values)) else 'red' for x in values ]
sns.barplot(y=idx,x=values,palette=clrs).set(title='Important features Random Forest Model')
plt.ylabel("Features", fontsize=510)
plt.tick_params(axis='x', which='major', labelsize=9)
plt.tick_params(axis='y', labelsize=7,labelrotation=42)
plt.show()
plt.clf()
Of the top features for our random forest model, it’s interesting to note that two through five are also important features of our anomaly detection model.
Gradient boosting also uses incorrect predictions from previous trees to adjust the next tree though this is accomplished by fitting each new tree based on the errors of the previous tree’s predictions. Mistakes from the previous trees are used to build a new tree solely around these mistakes. As mentioned early in AdaBoost, gradient boosting is taking these errors (weak learner) and making them a strong learner. The difference is the gradient boost algorithm only uses the errors from the previous tree in contrast to AdaBoost.
The main idea behind this algorithm is to build models sequentially and these subsequent models try to reduce the errors of the previous model. Errors are reduced by building a new model on the errors or residuals of the previous model.
from sklearn.ensemble import GradientBoostingClassifier
gb_base_clf=gb.fit(X_train_tr, y_train_np)
start_time = time.time()
gb_base_cv_accuracy=cross_val_score(gb_base_clf, X_train_tr, y_train_np, cv=skf, scoring='accuracy').mean().round(2)
gbBase_CrossValAccur = time.time() - start_time
start_time = time.time()
gb_base_cv_recall=cross_val_score(gb_base_clf, X_train_tr, y_train_np, cv=skf, scoring=recall_scorer).mean().round(2)
gbBase_CrossValRecall= time.time() - start_time
start_time = time.time()
gb_base_cv_precision=cross_val_score(gb_base_clf, X_train_tr, y_train_np, cv=skf, scoring=precision_scorer).mean().round(2)
gbBase_CrossValPrec = time.time() - start_time
start_time = time.time()
gb_base_cv_f1=cross_val_score(gb_base_clf, X_train_tr, y_train_np, cv=skf, scoring='f1').mean().round(2)
gbBase_CrossValF1 = time.time() - start_time
start_time = time.time()
gb_base_cv_auc=cross_val_score(gb_base_clf, X_train_tr, y_train_np, cv=skf, scoring=roc_auc_scorer).mean().round(2)
gbBase_CrossValAuc = time.time() - start_time
gb_params={
'subsample':[0.4, 0.6, 0.7, 0.75],
'n_estimators':np.arange(500, 10000, 500),
'learning_rate':[0.0001, 0.001,.01,0.05, 0.075,0.1],
'max_features':range(6,20,2),
'min_samples_split':range(1000,2200,200),
'min_samples_leaf':range(30,70,10),
'max_depth':range(4,16,2),
}
search_cv_gb=RandomizedSearchCV(estimator=gb,param_distributions=gb_params,n_iter=6, scoring='roc_auc', cv=5, verbose=1, refit=True,return_train_score=True, n_jobs=-1, random_state=2)
start_time = time.time()
search_cv_gb.fit(X_train_tr, y_train_np)
RandomizedSearchCV(cv=5, estimator=GradientBoostingClassifier(warm_start=True),
n_iter=6, n_jobs=-1,
param_distributions={'learning_rate': [0.0001, 0.001, 0.01,
0.05, 0.075, 0.1],
'max_depth': range(4, 16, 2),
'max_features': range(6, 20, 2),
'min_samples_leaf': range(30, 70, 10),
'min_samples_split': range(1000, 2200, 200),
'n_estimators': array([ 500, 1000, 1500, 2000, 2500, 3000, 3500, 4000, 4500, 5000, 5500,
6000, 6500, 7000, 7500, 8000, 8500, 9000, 9500]),
'subsample': [0.4, 0.6, 0.7, 0.75]},
random_state=2, return_train_score=True, scoring='roc_auc',
verbose=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. RandomizedSearchCV(cv=5, estimator=GradientBoostingClassifier(warm_start=True),
n_iter=6, n_jobs=-1,
param_distributions={'learning_rate': [0.0001, 0.001, 0.01,
0.05, 0.075, 0.1],
'max_depth': range(4, 16, 2),
'max_features': range(6, 20, 2),
'min_samples_leaf': range(30, 70, 10),
'min_samples_split': range(1000, 2200, 200),
'n_estimators': array([ 500, 1000, 1500, 2000, 2500, 3000, 3500, 4000, 4500, 5000, 5500,
6000, 6500, 7000, 7500, 8000, 8500, 9000, 9500]),
'subsample': [0.4, 0.6, 0.7, 0.75]},
random_state=2, return_train_score=True, scoring='roc_auc',
verbose=1)GradientBoostingClassifier(warm_start=True)
GradientBoostingClassifier(warm_start=True)
gb_grid_training_time = time.time() - start_time
gb_cv_results=pd.DataFrame(search_cv_gb.cv_results_)
gb_cv_results[['mean_train_score', 'std_train_score','mean_test_score', 'std_test_score']].mean()
## mean_train_score 0.933659
## std_train_score 0.000696
## mean_test_score 0.881175
## std_test_score 0.006381
## dtype: float64
Cross validation mean score for train is .05 higher than the mean test score. This could be a case of overfitting. Applying the classifier to test data will provide more metrics to help us.
gb_clf=search_cv_gb.best_estimator_
gb_clf
GradientBoostingClassifier(max_depth=12, max_features=10, min_samples_leaf=30,
min_samples_split=1000, n_estimators=3000,
subsample=0.7, warm_start=True)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. GradientBoostingClassifier(max_depth=12, max_features=10, min_samples_leaf=30,
min_samples_split=1000, n_estimators=3000,
subsample=0.7, warm_start=True)The above display presents the parameters chosen for the gradient boost model.
start_time = time.time()
gb_cv_f1_score=cross_val_score(gb_clf, X_train_tr, y_train_np,
scoring='f1', cv=skf, n_jobs=-1).mean().round(2)
gb_CrossValF1_time = time.time() - start_time
start_time = time.time()
gb_cv_accuracy=cross_val_score(gb_clf, X_train_tr, y_train_np,
scoring='accuracy', cv=skf, n_jobs=-1).mean().round(2)
gb_CrossValAccur_time = time.time() - start_time
start_time = time.time()
gb_cv_recall=cross_val_score(gb_clf, X_train_tr, y_train_np,
scoring=recall_scorer, cv=skf, n_jobs=-1).mean().round(2)
gb_CrossValRecall_time = time.time() - start_time
start_time = time.time()
gb_cv_precision=cross_val_score(gb_clf, X_train_tr, y_train_np,
scoring=precision_scorer, cv=skf, n_jobs=-1).mean().round(2)
gb_CrossValPrec_time = time.time() - start_time
start_time = time.time()
gb_cv_auc=cross_val_score(gb_clf, X_train_tr, y_train_np,
scoring=roc_auc_scorer, cv=skf, n_jobs=-1).mean().round(2)
gb_CrossValAuc_time = time.time() - start_time
cm_gb_vl = metrics.confusion_matrix(y_test_np, y_pred_gb, labels=[0,1])
df_cm_gb_vl = pd.DataFrame(cm_gb_vl, index=["Actual - No", "Actual - Yes"], columns=["Predicted - No", "Predicted - Yes"])
group_counts = ["{0:0.0f}".format(value) for value in cm_gb_vl.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in cm_gb_vl.flatten()/np.sum(cm_gb_vl)]
labels = [f"{v1}\n{v2}" for v1, v2 in zip(group_counts, group_percentages)]
labels = np.asarray(labels).reshape(2,2)
plt.figure(figsize=(9,9))
sns.heatmap(df_cm_gb_vl, annot=labels, fmt='')
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.title("Confusion Matrix-Gradient Boost", fontsize=14)
plt.show()
plt.clf()
The performance of our gradient boost classifier at accurately predicting fraud of Yes is 21.05%, which is just above the random Forest’s 19.66%, only and SVC’s 20%. However, gradient boost’s inaccurate predictions of fraud of yes is 3.18%, 1% more than the other two classifiers.
# create a dictionary of features and their importance values
feat_dict_gb = {}
for col, val in sorted(zip(X_train_tr.columns,gb_clf.feature_importances_),key=lambda x:x[1],reverse=True):
feat_dict_gb[col]=val
feat_gb_df= pd.DataFrame({'Feature':feat_dict_gb.keys(),'Importance':feat_dict_gb.values()})
feat_gb_tp5=feat_gb_df.nlargest(5,"Importance")
values = feat_gb_tp5.Importance
idx = feat_gb_tp5.Feature
plt.figure(figsize=(12,10))
clrs = ['green' if (x < max(values)) else 'red' for x in values ]
sns.barplot(y=idx,x=values,palette=clrs).set(title='Important features Gradient Boosting Model')
plt.ylabel("Features", fontsize=8)
plt.tick_params(axis='x', which='major', labelsize=8)
plt.tick_params(axis='y', labelsize=7, labelrotation=42)
plt.show()
plt.clf()
The top features of our gradient boost model are the same as those from the random forest model
Extreme Gradient boosting is similar to gradient boosting with a few improvements. First, enhancements make it faster than other ensemble methods. Secondly, built-in regularization allows it to have an advantage in accuracy. Regularization is the process of adding information to reduce variance and prevent over fitting.
from xgboost import XGBClassifier
xgb=XGBClassifier(booster='gbtree',objective='binary:logistic', n_jobs=-1)
xgb_base_clf=xgb.fit(X_train_tr, y_train_np)
start_time = time.time()
xgb_base_cv_accuracy=cross_val_score(xgb_base_clf, X_train_tr, y_train_np, cv=skf, scoring='accuracy').mean().round(2)
xgbBase_CrossValAccur = time.time() - start_time
start_time = time.time()
xgb_base_cv_recall=cross_val_score(xgb_base_clf, X_train_tr, y_train_np, cv=skf, scoring=recall_scorer).mean().round(2)
xgbBase_CrossValRecall = time.time() - start_time
start_time = time.time()
xgb_base_cv_precision=cross_val_score(xgb_base_clf, X_train_tr, y_train_np, cv=skf, scoring=precision_scorer).mean().round(2)
xgbBase_CrossValPrec = time.time() - start_time
start_time = time.time()
xgb_base_cv_f1=cross_val_score(xgb_base_clf, X_train_tr, y_train_np, cv=skf, scoring='f1').mean().round(2)
xgbBase_CrossValF1 = time.time() - start_time
start_time = time.time()
xgb_base_cv_auc=cross_val_score(xgb_base_clf, X_train_tr, y_train_np, cv=skf, scoring=roc_auc_scorer).mean().round(2)
xgbBase_CrossValAuc = time.time() - start_time
params_xg={
"learning_rate": [0.01, 0.05, 0.10, 0.20,0.25,0.4, 0.5],
"max_depth": range(2,10,2),
"min_child_weight": [1,3,5,7],
"gamma": [0.0,0.01,0.05,0.1,0.5,1,2,3],
"colsample_bytree": [0.5,0.6,0.7,0.8,0.9,1],
"colsample_bynode": [0.5,0.6,0.7,0.8,0.9,1],
"colsample_bylevel": [0.5,0.6,0.7,0.8,0.9,1],
"n_estimators":np.arange(500, 4000, 500),
'subsample': [0.5,0.6,0.7,0.8,0.9,1]
}
search_xg=RandomizedSearchCV(estimator=xgb,
param_distributions=params_xg,n_iter=6, scoring='roc_auc', cv=5, verbose=3, refit=True,return_train_score=True, n_jobs=-1)
start_time = time.time()
search_xg.fit(X_train_tr, y_train_np)
RandomizedSearchCV(cv=5,
estimator=XGBClassifier(base_score=None, booster='gbtree',
callbacks=None,
colsample_bylevel=None,
colsample_bynode=None,
colsample_bytree=None,
early_stopping_rounds=None,
enable_categorical=False,
eval_metric=None, feature_types=None,
gamma=None, gpu_id=None,
grow_policy=None,
importance_type=None,
interaction_constraints=None,
learning_...
'colsample_bynode': [0.5, 0.6, 0.7, 0.8,
0.9, 1],
'colsample_bytree': [0.5, 0.6, 0.7, 0.8,
0.9, 1],
'gamma': [0.0, 0.01, 0.05, 0.1, 0.5, 1,
2, 3],
'learning_rate': [0.01, 0.05, 0.1, 0.2,
0.25, 0.4, 0.5],
'max_depth': range(2, 10, 2),
'min_child_weight': [1, 3, 5, 7],
'n_estimators': array([ 500, 1000, 1500, 2000, 2500, 3000, 3500]),
'subsample': [0.5, 0.6, 0.7, 0.8, 0.9,
1]},
return_train_score=True, scoring='roc_auc', verbose=3)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. RandomizedSearchCV(cv=5,
estimator=XGBClassifier(base_score=None, booster='gbtree',
callbacks=None,
colsample_bylevel=None,
colsample_bynode=None,
colsample_bytree=None,
early_stopping_rounds=None,
enable_categorical=False,
eval_metric=None, feature_types=None,
gamma=None, gpu_id=None,
grow_policy=None,
importance_type=None,
interaction_constraints=None,
learning_...
'colsample_bynode': [0.5, 0.6, 0.7, 0.8,
0.9, 1],
'colsample_bytree': [0.5, 0.6, 0.7, 0.8,
0.9, 1],
'gamma': [0.0, 0.01, 0.05, 0.1, 0.5, 1,
2, 3],
'learning_rate': [0.01, 0.05, 0.1, 0.2,
0.25, 0.4, 0.5],
'max_depth': range(2, 10, 2),
'min_child_weight': [1, 3, 5, 7],
'n_estimators': array([ 500, 1000, 1500, 2000, 2500, 3000, 3500]),
'subsample': [0.5, 0.6, 0.7, 0.8, 0.9,
1]},
return_train_score=True, scoring='roc_auc', verbose=3)XGBClassifier(base_score=None, booster='gbtree', callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric=None, feature_types=None,
gamma=None, gpu_id=None, grow_policy=None, importance_type=None,
interaction_constraints=None, learning_rate=None, max_bin=None,
max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=None, max_leaves=None,
min_child_weight=None, missing=nan, monotone_constraints=None,
n_estimators=100, n_jobs=-1, num_parallel_tree=None,
predictor=None, random_state=None, ...)XGBClassifier(base_score=None, booster='gbtree', callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric=None, feature_types=None,
gamma=None, gpu_id=None, grow_policy=None, importance_type=None,
interaction_constraints=None, learning_rate=None, max_bin=None,
max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=None, max_leaves=None,
min_child_weight=None, missing=nan, monotone_constraints=None,
n_estimators=100, n_jobs=-1, num_parallel_tree=None,
predictor=None, random_state=None, ...)
xgb_grid_training_time = time.time() - start_time
xgb_cv_results=pd.DataFrame(search_xg.cv_results_)
xgb_cv_results[['mean_train_score', 'std_train_score','mean_test_score', 'std_test_score']].mean()
## mean_train_score 0.984471
## std_train_score 0.000194
## mean_test_score 0.902756
## std_test_score 0.006329
## dtype: float64
Cross validation mean score for train is .08 higher than the mean test score. This could be a case of overfitting. Applying the classifier to test data will provide more metrics to help us.
xg_clf=search_xg.best_estimator_
XGBClassifier(base_score=None, booster='gbtree', callbacks=None,
colsample_bylevel=0.7, colsample_bynode=1, colsample_bytree=0.6,
early_stopping_rounds=None, enable_categorical=False,
eval_metric=None, feature_types=None, gamma=0.01, gpu_id=None,
grow_policy=None, importance_type=None,
interaction_constraints=None, learning_rate=0.01, max_bin=None,
max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=8, max_leaves=None,
min_child_weight=1, missing=nan, monotone_constraints=None,
n_estimators=1500, n_jobs=-1, num_parallel_tree=None,
predictor=None, random_state=None, ...)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. XGBClassifier(base_score=None, booster='gbtree', callbacks=None,
colsample_bylevel=0.7, colsample_bynode=1, colsample_bytree=0.6,
early_stopping_rounds=None, enable_categorical=False,
eval_metric=None, feature_types=None, gamma=0.01, gpu_id=None,
grow_policy=None, importance_type=None,
interaction_constraints=None, learning_rate=0.01, max_bin=None,
max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=8, max_leaves=None,
min_child_weight=1, missing=nan, monotone_constraints=None,
n_estimators=1500, n_jobs=-1, num_parallel_tree=None,
predictor=None, random_state=None, ...)
start_time = time.time()
xg_cv_accuracy=cross_val_score(xg_clf, X_train_tr, y_train_np,
scoring='accuracy', cv=skf, n_jobs=-1).mean().round(2)
xgb_CrossValAccur_time = time.time() - start_time
start_time = time.time()
xg_cv_recall=cross_val_score(xg_clf, X_train_tr, y_train_np,
scoring=recall_scorer, cv=skf, n_jobs=-1).mean().round(2)
xgb_CrossValRecall_time = time.time() - start_time
start_time = time.time()
xg_cv_precision=cross_val_score(xg_clf, X_train_tr, y_train_np,
scoring=precision_scorer, cv=skf, n_jobs=-1).mean().round(2)
xgb_CrossValPrec_time = time.time() - start_time
start_time = time.time()
xg_cv_auc=cross_val_score(xg_clf, X_train_tr, y_train_np,
scoring=roc_auc_scorer, cv=skf, n_jobs=-1).mean().round(2)
xgb_CrossValAuc_time = time.time() - start_time
start_time = time.time()
xg_cv_f1=cross_val_score(xg_clf, X_train_tr, y_train_np,
scoring='f1', cv=skf, n_jobs=-1).mean().round(2)
xgb_CrossValF1_time = time.time() - start_time
y_test_pred_base_xg=xgb_base_clf.predict(X_test_tr)
## *******Extreme Gradient Boost Classification Report********
## precision recall f1-score support
##
## 0 0.91 0.97 0.94 5137
## 1 0.89 0.74 0.81 1909
##
## accuracy 0.91 7046
## macro avg 0.90 0.86 0.88 7046
## weighted avg 0.91 0.91 0.90 7046
xg_recall_test_base=recall_score(y_test_np, y_test_pred_base_xg).round(2)
xg_roc_test_base=roc_auc_score(y_test_np, y_test_pred_base_xg).round(2)
xg_test_accuracy_base=accuracy_score(y_test_np, y_test_pred_base_xg).round(2)
xg_test_f1_base=f1_score(y_test_np, y_test_pred_base_xg).round(2)
cm_xg_vl = metrics.confusion_matrix(y_test_np, y_test_pred_base_xg, labels=[0,1])
df_cm_xg_vl = pd.DataFrame(cm_xg_vl, index=["Actual - No", "Actual - Yes"], columns=["Predicted - No", "Predicted - Yes"])
group_counts = ["{0:0.0f}".format(value) for value in cm_xg_vl.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in cm_xg_vl.flatten()/np.sum(cm_xg_vl)]
labels = [f"{v1}\n{v2}" for v1, v2 in zip(group_counts, group_percentages)]
labels = np.asarray(labels).reshape(2,2)
plt.figure(figsize=(9,9))
sns.heatmap(df_cm_xg_vl, annot=labels, fmt='')
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.title("Confusion Matrix-Extreme Gradient Boost", fontsize=14)
plt.show()
plt.clf()
The extreme gradient classifier’s accurate and inaccurate predictions of fraud Yes are close to both random forest and svc classifiers at 20.18% and 2.41%. respectively.
# let's create a dictionary of features and their importance values
feat_dict_xg= {}
for col, val in sorted(zip(X_train_tr.columns,xgb_base_clf.feature_importances_),key=lambda x:x[1],reverse=True):
feat_dict_xg[col]=val
feat_xg_df = pd.DataFrame({'Feature':feat_dict_xg.keys(),'Importance':feat_dict_xg.values()})
feat_xg_tp5=feat_xg_df.nlargest(5,"Importance")
values = feat_xg_tp5.Importance
idx = feat_xg_tp5.Feature
plt.figure(figsize=(12,10))
clrs = ['green' if (x < max(values)) else 'red' for x in values ]
sns.barplot(y=idx,x=values,palette=clrs).set(title='Important features Extreme Gradient Boost Model')
plt.ylabel("Features", fontsize=9)
plt.tick_params(axis='x', which='major', labelsize=8)
plt.tick_params(axis='y', labelsize=7, labelrotation=42)
plt.show()
plt.clf()
Extreme Gradient Boost’s highest scoring important feature of severity of incident level Major Damage the same as gradient boost. The difference is it’s score of 20% dominates compared to the other important features scores which hover around 2%.
metric_comparison=pd.DataFrame({'Model':['Logistic Regresion','Support Vector Machine', 'Random Forest', 'Gradient Boosting', 'Extreme Gradient Boosting'],
'RecallBase':[lr_base_cv_recall_score, svc_base_cv_recall,rf_base_cv_recall, gb_base_cv_recall, xgb_base_cv_recall],
'RecallTune':[lr_cv_recall_score, svc_cv_recall_score,rf_cv_recall, gb_cv_recall, xg_cv_recall],
'PrecisionBase':[lr_base_cv_precision_score, svc_base_cv_precision,rf_base_cv_precision, gb_base_cv_precision, xgb_base_cv_precision],
'PrecisionTune':[lr_cv_precision_score, svc_cv_precision,rf_cv_precision, gb_cv_precision, xg_cv_precision],
'F1Base':[lr_base_cv_f1,svc_base_cv_f1, rf_base_cv_f1,gb_base_cv_f1, xgb_base_cv_f1],
'F1Tune':[lr_cv_f1_score, svc_cv_f1_score,rf_cv_f1, gb_cv_f1_score, xg_cv_f1],
'AUCBase':[lr_base_cv_auc_score, svc_base_cv_auc_score,rf_base_cv_auc, gb_base_cv_auc, xgb_base_cv_auc],
'AUCTune':[lr_cv_auc_score, svc_cv_auc, rf_cv_auc, gb_cv_auc,xg_cv_auc],
'GridTuneTime':[log_grid_training_time, svc_grid_training_time, rf_grid_training_time, gb_grid_training_time,xgb_grid_training_time],
'CVTime':[log_cross_val_Time, svc_cross_val_Time, rf_cross_val_Time, gb_cross_val_Time, xgb_cross_val_Time],
'CVBaseTime':[log_cross_valBase_Time,svcBase_cross_val_Time, rfBase_cross_val_Time,gbBase_cross_val_Time,xgbBase_cross_val_Time]})
metricTest_comparison=pd.DataFrame({'Model':['Support Vector Machine', 'Random Forest', 'Extreme Gradient Boosting'],
'RecallBase':[svc_base_cv_recall,rf_base_cv_recall, xgb_base_cv_recall],
'RecallTest':[svc_recallBase_test,rfBase_recall_test, xg_recall_test_base],
'PrecisionBase':[svc_base_cv_precision,rf_base_cv_precision, xgb_base_cv_precision],
'PrecisionTest':[svcBase_precision_test,rfBase_precision_test, xg_test_precision_base],
'F1Base':[svc_base_cv_f1, rf_base_cv_f1,xgb_base_cv_f1],
'F1Test':[svc_f1Base_test,rfBase_test_f1, xg_test_f1_base],
'AUCBase':[svc_base_cv_auc_score,rf_base_cv_auc, xgb_base_cv_auc],
'AUCTest':[svc_aucBase_test, rfBase_roc_test,xg_roc_test_base]
})
metric_comparison=metric_comparison.round({'RecallCVbase':2,'RecallCVgridSearch':2,'PrecisionCVbase':2,
'PrecisionCVgridSearch':2,'F1CVbase':2,'F1CVgridSearch':2,'AUCcvBase':2, 'AUCcvGridSearch':2,'GridTuneTime':2, 'CVTime':2, 'CVBaseTime':2})
library(dplyr)
library(gt)
library(gtExtras)
library(ggsci)
library(RColorBrewer)
library(ggplot2)
library(readr)
gt_comp_tbl <-
gt(metric_comparison) %>%
tab_header(
title = md("**Model Cross Validation Comparison**"),
subtitle = "Evaluation and Performance Metrics"
) %>%
tab_spanner(
label = "Metrics",
columns = c(RecallBase, RecallTune , PrecisionBase,PrecisionTune,F1Base,F1Tune,AUCBase, AUCTune)
) %>%
tab_spanner(
label = "Time",
columns = c(GridTuneTime,CVTime, CVBaseTime,CVTimeDiff )
) %>%
tab_style(
style=cell_text(size=px(10)),
locations = cells_column_labels(c(Model,RecallBase, RecallTune , PrecisionBase,PrecisionTune,F1Base,F1Tune,AUCBase, AUCTune,GridTuneTime,CVTime, CVBaseTime, CVTimeDiff)
)) %>%
tab_style(
style=cell_text(size=px(9.5)),
locations = cells_body(c(RecallBase, RecallTune , PrecisionBase,PrecisionTune,F1Base,F1Tune,AUCBase, AUCTune,GridTuneTime,CVTime, CVBaseTime, CVTimeDiff))
) %>%
tab_style(
style = cell_text(size=px(10)),
locations=cells_body(Model)
) %>%
data_color(
columns=c(PrecisionBase, PrecisionTune),
method="numeric",
palette="YlGn",
domain=c(0.91,0.5)) %>%
data_color(
columns=c(RecallBase, RecallTune, F1Base, F1Tune, AUCBase, AUCTune),
palette=c("#ffffff","#5A2D81"), domain=c(0.90,0.5)) %>%
data_color(
columns=c(GridTuneTime,CVTime,CVBaseTime, CVTimeDiff),
palette=c("#ffffff","#FFC72C"), domain=c(231,-59)) %>%
tab_options(table.background.color = "lightcyan") %>%
tab_source_note(source_note = md("**Precsion, Recall, and F1 scores are displayed for prediction of 1 (Fraud=Yes)**")) %>%
tab_style(
style=cell_text(size=px(9)),
locations=cells_source_notes()
)
| Model Cross Validation Comparison | ||||||||||||
| Evaluation and Performance Metrics | ||||||||||||
| Model |
Metrics
|
Time
|
||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| RecallBase | RecallTune | PrecisionBase | PrecisionTune | F1Base | F1Tune | AUCBase | AUCTune | GridTuneTime | CVTime | CVBaseTime | CVTimeDiff | |
| Logistic Regresion | 0.51 | 0.51 | 0.65 | 0.65 | 0.57 | 0.57 | 0.70 | 0.70 | 1.46 | 0.66 | 1.78 | 1.12 |
| Support Vector Machine | 0.73 | 0.78 | 0.89 | 0.85 | 0.80 | 0.81 | 0.85 | 0.86 | 171.50 | 92.37 | 33.86 | -58.51 |
| Random Forest | 0.72 | 0.73 | 0.90 | 0.90 | 0.80 | 0.80 | 0.84 | 0.85 | 230.96 | 21.77 | 2.66 | -19.11 |
| Gradient Boosting | 0.55 | 0.77 | 0.74 | 0.87 | 0.63 | 0.82 | 0.74 | 0.87 | 135.59 | 42.59 | 41.63 | -0.96 |
| Extreme Gradient Boosting | 0.75 | 0.76 | 0.89 | 0.91 | 0.81 | 0.83 | 0.86 | 0.86 | 98.62 | 47.93 | 10.76 | -37.17 |
| Precsion, Recall, and F1 scores are displayed for prediction of 1 (Fraud=Yes) | ||||||||||||
The table above presents evaluation metrics for class prediction of 1 (Yes), where the event of interest is a fraudulent submission. While all metrics are considered, Precision is our primary focus. A higher precision score indicates fewer false positives—cases where the model incorrectly flags a legitimate submission as fraudulent. Minimizing these errors is critical, as we do not want to falsely accuse a policyholder of fraud.
Recall, which reflects the rate of false negatives (predicting “No” when the actual class is “Yes”), is also important and will be the secondary metric of focus after precision.
From the table, we observe that the Support Vector Machine (SVM) model achieves the highest precision score, followed by Random Forest and Extreme Gradient Boost. Interestingly, the base models for SVM and Random Forest performed slightly better (by 0.01) than their tuned counterparts. Similarly, the base model for Extreme Gradient Boost was only 0.01 lower than its tuned version.
This presents two key advantages of using base models:
-No hyperparameter tuning required, which significantly reduces grid search time—avoiding delays ranging from approximately 88 to 140 seconds.
-Faster cross-validation time, with reductions ranging from 17 to 34 seconds.
Our next step is to compare these cross-validation results with those from the test set (unseen data). We will proceed with the three models that showed the best precision and training efficiency: SVM, Random Forest, and Extreme Gradient Boost.
gt_Testcomp_tbl <-
gt(metricTests_comparison) %>%
tab_header(
title = md("**Model Test/Cross Validationm Comparison**"),
subtitle = "Evaluation Metrics"
) %>%
tab_spanner(
label = "Metrics",
columns = c(RecallBase, RecallTest , PrecisionBase,PrecisionTest,F1Base,F1Test,AUCBase, AUCTest)
) %>%
tab_style(
style=cell_text(size=px(10)),
locations = cells_column_labels(c(Model,RecallBase, RecallTest,PrecisionBase,PrecisionTest,F1Base,F1Test,AUCBase, AUCTest)
)) %>%
tab_style(
style=cell_text(size=px(10)),
locations=cells_body(c(RecallBase, RecallTest, PrecisionBase,PrecisionTest,F1Base,F1Test, AUCBase, AUCTest))
) %>%
tab_style(
style = cell_text(size=px(11)),
locations=cells_body(Model)
) %>%
data_color(
columns=c(PrecisionBase,PrecisionTest),
method="numeric",
palette="Blues",
domain=c(0.91,0.88)) %>%
data_color(
columns=c(RecallBase, F1Base, AUCBase),
palette=c("#ffffff","#5A2D81"), domain=c(0.86,0.71)) %>%
data_color(
columns=c( RecallTest, F1Test, AUCTest),
palette=c("#ffffff","#FFC72C"), domain=c(0.86,0.71)) %>%
tab_options(table.background.color = "lightcyan") %>%
tab_source_note(source_note = md("**Precsion, Recall, and F1 scores are displayed for prediction of 1 (Fraud=Yes)**")) %>%
tab_style(
style=cell_text(size=px(9)),
locations=cells_source_notes()
) %>%
tab_style(
style = cell_text(size=px(9)),
locations = cells_column_spanners()
)
| Model Test/Cross Validationm Comparison | ||||||||||||
| Evaluation Metrics | ||||||||||||
| Model |
Metrics
|
Difference in Cross Validation and Test Scores
|
||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| RecallBase | RecallTest | PrecisionBase | PrecisionTest | F1Base | F1Test | AUCBase | AUCTest | Recall_diff | Precision_diff | F1_diff | AUC_diff | |
| Support Vector Machine | 0.73 | 0.74 | 0.89 | 0.90 | 0.80 | 0.81 | 0.85 | 0.85 | -0.01 | -0.01 | -0.01 | 0.00 |
| Random Forest | 0.72 | 0.73 | 0.90 | 0.91 | 0.80 | 0.81 | 0.84 | 0.85 | -0.01 | -0.01 | -0.01 | -0.01 |
| Extreme Gradient Boosting | 0.75 | 0.74 | 0.89 | 0.89 | 0.81 | 0.81 | 0.86 | 0.86 | 0.01 | 0.00 | 0.00 | 0.00 |
| Precsion, Recall, and F1 scores are displayed for prediction of 1 (Fraud=Yes) | ||||||||||||
The table above compares cross-validation scores with test
set performance to evaluate how well our selected models generalize to
unseen data. Specifically, we are looking for signs of overfitting or
underfitting:
-Overfitting occurs when a model performs well on training data but poorly on new data—often indicated by training scores significantly higher than test scores.
-Underfitting happens when a model doesn’t learn the training data well, which can be suggested when
test scores are higher than training scores.
Looking at the metrics, we see that both the Support Vector Machine and Random Forest models exhibit a slight underfitting pattern for Precision, with a minor drop of 0.01 between cross-validation and test scores.
Examining other metrics:
-Recall and AUC scores are very consistent across train and test sets.
-Both SVM and Random Forest models show only a 0.01 difference in Recall and AUC, suggesting strong generalization.
While both models perform similarly and generalize well, we note that Random Forest has a slightly higher Precision Test score (0.91 vs. 0.90) and matches SVM in F1 and AUC.
Given the strong overall performance and balance across metrics, Random Forest is selected as the final model for fitting to new data. However, SVM remains a strong alternative and may still be considered in further evaluations.
Now that we have chosen a final model, we will use it to predict on new data. We will go through the same steps of cleaning, feature engineering, and preparation as with the original data.
The only difference between the new data and the data used for training and testing is that there are no labels, meaning that the observations have not been labled as fraud or no fraud.
We will validate that the columns of our newly imported data are equal to the columns of our originally imported data
def df_columns_equal(df_original, df_new):
assert df_original.columns.equals(df_new.columns), f"Mismatch in orginal data frame and new data frame columns"
print("Orginal and new data frame columns match")
df_columns_equal(Train_Demographics_p, new_Demographics)
## Orginal and new data frame columns match
df_columns_equal(Train_Claim_p,new_Claim)
## Orginal and new data frame columns match
df_columns_equal(Train_Policy_p,new_Policy)
## Orginal and new data frame columns match
We’ve confirmed that the columns of our new data sets are equal to the columns of our original data sets. We can now proceed to merge our new data sets.
new_fraud=new_Claim.merge(new_Demographics, on="CustomerID")\
.merge(new_Policy, on="CustomerID")
Function to check if data is Data Frame
def check_is_dataframe(df):
assert isinstance(df, pd.DataFrame), f"Error: data is not Data Frame."
print("Data is Data Frame")
check_is_dataframe(new_fraud)
## Data is Data Frame
## Shape new_fraud: (8912, 37)
Our new data frame has 8,912 rows and 37 columns.
## new_fraud data types
## CustomerID object
## DateOfIncident object
## TypeOfIncident object
## TypeOfCollission object
## SeverityOfIncident object
## AuthoritiesContacted object
## IncidentState object
## IncidentCity object
## IncidentAddress object
## IncidentTime int32
## NumberOfVehicles int32
## PropertyDamage object
## BodilyInjuries int32
## Witnesses object
## PoliceReport object
## AmountOfTotalClaim object
## AmountOfInjuryClaim int32
## AmountOfPropertyClaim int32
## AmountOfVehicleDamage int32
## InsuredAge float64
## InsuredZipCode float64
## InsuredGender object
## InsuredEducationLevel object
## InsuredOccupation object
## InsuredHobbies object
## CapitalGains float64
## CapitalLoss float64
## Country object
## InsurancePolicyNumber float64
## CustomerLoyaltyPeriod float64
## DateOfPolicyCoverage object
## InsurancePolicyState object
## Policy_CombinedSingleLimit object
## Policy_Deductible float64
## PolicyAnnualPremium float64
## UmbrellaLimit float64
## InsuredRelationship object
## dtype: object
new_fraud_v2=new_fraud.copy()
Reviewing Data Types of the new data we notice that certain columns that are numeric in our original train/test data have a data type of object. We’ll use a function to transform these columns to a numeric data type.
def convert_object_to_int(df, columns):
"""
Converts specified object columns to int32.
Parameters:
df (pd.DataFrame): The DataFrame containing object columns.
columns (list): List of column names to convert.
Returns:
pd.DataFrame: DataFrame with transformed columns.
"""
for col in columns:
df[col] = pd.to_numeric(df[col], errors='coerce').astype('Int32')
convert_object_to_int(new_fraud_v2,['AmountOfTotalClaim'])
## new_fraud_v2 data types
## CustomerID object
## DateOfIncident object
## TypeOfIncident object
## TypeOfCollission object
## SeverityOfIncident object
## AuthoritiesContacted object
## IncidentState object
## IncidentCity object
## IncidentAddress object
## IncidentTime int32
## NumberOfVehicles int32
## PropertyDamage object
## BodilyInjuries int32
## Witnesses object
## PoliceReport object
## AmountOfTotalClaim Int32
## AmountOfInjuryClaim int32
## AmountOfPropertyClaim int32
## AmountOfVehicleDamage int32
## InsuredAge float64
## InsuredZipCode float64
## InsuredGender object
## InsuredEducationLevel object
## InsuredOccupation object
## InsuredHobbies object
## CapitalGains float64
## CapitalLoss float64
## Country object
## InsurancePolicyNumber float64
## CustomerLoyaltyPeriod float64
## DateOfPolicyCoverage object
## InsurancePolicyState object
## Policy_CombinedSingleLimit object
## Policy_Deductible float64
## PolicyAnnualPremium float64
## UmbrellaLimit float64
## InsuredRelationship object
## dtype: object
We will use our previously create function to transform the date columns to a correct data type of data time.
convert_to_datetime(new_fraud_v2,'DateOfIncident')
convert_to_datetime(new_fraud_v2,'DateOfPolicyCoverage')
check_is_datetime(new_fraud_v2, 'DateOfIncident')
## Feature 'DateOfIncident' is datetime dtype
check_is_datetime(new_fraud_v2, 'DateOfPolicyCoverage')
## Feature 'DateOfPolicyCoverage' is datetime dtype
We have succesfully transformed the date columns to the appropriate dattetime data type.
We’ll now create new features from the date features.
new_fraud_v2["coverageIncidentDiff"]=(new_fraud_v2["DateOfIncident"]-new_fraud_v2["DateOfPolicyCoverage"])
new_fraud_v2["coverageIncidentDiff"]=new_fraud_v2["coverageIncidentDiff"]/np.timedelta64(1,'Y')
## count 8912.000000
## mean 13.130826
## std 6.591779
## min -0.032855
## 25% 7.610697
## 50% 13.298014
## 75% 18.804630
## max 25.065539
## Name: coverageIncidentDiff, dtype: float64
The range of CoverageIncidentDiff goes from a minimum of -0.032855 to a maximum of 25.06.
new_fraud_v2['dayOfWeek'] = new_fraud_v2["DateOfIncident"].dt.day_name()
new_fraud_v2['dayOfWeek'].value_counts(normalize=True).round(2)
## Saturday 0.15
## Wednesday 0.15
## Tuesday 0.15
## Friday 0.14
## Thursday 0.14
## Monday 0.14
## Sunday 0.13
## Name: dayOfWeek, dtype: float64
## ******** Unique Number of Vehicles********
## [3 1 2 4]
## ******** Unique Bodily Injuries********
## [0 1 2]
## NumberOfVehicles and BodilyInjuries data type:
## NumberOfVehicles int32
## BodilyInjuries int32
## dtype: object
Both BodilyInjuries and NumberofVehicles have small number of unique values yet their data types are of type int. They would be best as categorical. We’ll apply our already created function to transform them.
convert_to_cat(new_fraud_v2, 'NumberOfVehicles')
convert_to_cat(new_fraud_v2, 'BodilyInjuries')
check_is_categorical(new_fraud_v2,'NumberOfVehicles')
## Feature 'NumberOfVehicles' is categorical dtype
check_is_categorical(new_fraud_v2,'BodilyInjuries')
## Feature 'BodilyInjuries' is categorical dtype
## *************Incident Time Unique Values*************
## [ 4 16 20 10 7 22 6 14 15 19 12 17 18 5 13 11 23 8 21 9 3 2 1 0
## -5]
time_day={
5:'early morning', 6:'early morning',7:'early morning', 8:'early morning',9:'late morning', 10: 'late morning', 11: 'late morning', 12:'early afternoon', 13:'early afternoon', 14:'early afternoon', 15:'early afternoon',16:'late afternoon', 17:'late afternoon', 18:'evening',
19:'evening', 20:'night', 1:'night', 2:'night', 3:'night', 4:'night', 21:'night', 22:'night', 23:'night', 24:'night'
}
new_fraud_v2['IncidentPeriodDay']=new_fraud_v2['IncidentTime'].map(time_day)
## ***Incident Period Day Value Counts***
## night 2333
## early morning 1765
## early afternoon 1732
## late morning 1149
## late afternoon 1018
## evening 808
## Name: IncidentPeriodDay, dtype: int64
## Data frame includes datatypes object True
new_fraud_v3=new_fraud_v2.copy()
new_fraud_v3=new_fraud_v3.drop(['DateOfIncident', 'DateOfPolicyCoverage', 'IncidentTime'], axis=1)
As with our original data set, we will convert object data types to categorical.
convert_cats(new_fraud_v3)
check_no_object_dtype(new_fraud_v3)
## ✅ No object dtype columns found in the DataFrame.
gs=plt.GridSpec(1, 3)
fig=plt.figure(figsize=(10,8))
fig.suptitle('Categorical Counts-1', fontsize=8)
ax1=fig.add_subplot(gs[0, 0])
ax2=fig.add_subplot(gs[0, 1])
ax3=fig.add_subplot(gs[0,2])
#plt.title('Type of Incident',fontsize=7, y=1)
hg=sns.countplot(data = new_fraud_v3, x = 'TypeOfIncident', ax=ax1)
hg.tick_params(axis='both', which='major', labelsize=4)
hg.set_xlabel("Type of Incident", fontsize=5)
hg.set_ylabel("Count",fontsize=5)
#plt.title('Type of Collision',fontsize=7, y=1)
sp=sns.countplot(data=new_fraud_v3, x='TypeOfCollission', ax=ax2)
sp.tick_params(axis='both', which='major', labelsize=5)
sp.set_xlabel("Type of Collision", fontsize=5)
sp.set_ylabel("Count",fontsize=4)
#plt.title('Reported Fraud',fontsize=7, y=1)
bp=sns.countplot(data=new_fraud_v3, x='SeverityOfIncident', ax=ax3)
bp.tick_params(axis='both', which='major', labelsize=5)
bp.set_xlabel("SeverityOfIncident", fontsize=5)
bp.set_ylabel("Count", fontsize=5)
plt.tight_layout()
plt.show()
plt.clf()
my_tab=pd.crosstab(index=new_fraud_v3["TypeOfIncident"], columns=new_fraud_v3["TypeOfCollission"], normalize=True).round(2)
fig = plt.figure(figsize=(13, 10))
sns.heatmap(my_tab, cmap="BuGn",cbar=False, annot=True,linewidth=0.3)
plt.yticks(rotation=0)
## (array([0.5, 1.5, 2.5, 3.5]), [Text(0, 0.5, 'Multi-vehicle Collision'), Text(0, 1.5, 'Parked Car'), Text(0, 2.5, 'Single Vehicle Collision'), Text(0, 3.5, 'Vehicle Theft')])
plt.xticks(rotation=60)
## (array([0.5, 1.5, 2.5, 3.5]), [Text(0.5, 0, '?'), Text(1.5, 0, 'Front Collision'), Text(2.5, 0, 'Rear Collision'), Text(3.5, 0, 'Side Collision')])
plt.title('Type of Incident vs Type of Collision', fontsize=20)
plt.xlabel('TypeOfCollision', fontsize=15)
plt.ylabel('TypeOIncident', fontsize=15)
plt.show()
plt.clf()
new_fraud_v4=new_fraud_v3.copy()
We observe from the cross table that the ‘unknown’ type of collision is only associated with a small number of incident types related to collisions. These data points will be retained by renaming the “unknown” column to “none”.
new_fraud_v4['TypeOfCollission'] = new_fraud_v4['TypeOfCollission'].replace(['?'], 'None')
plt.figure(figsize=(16,10))
#plt.title("Type of Collision-Changed")
ax=sns.countplot(data=new_fraud_v4, x='TypeOfCollission')
#plt.tick_params(label_rotation=45)
ax.tick_params(axis='both', which='major', labelsize=11)
ax.set_title("Type of Collision-Changed", size=22)
ax.set(xlabel=None)
ax.set(ylabel=None)
sns.set_style("dark")
ax.annotate('Figure ##',
xy = (1.0, -0.2),
xycoords='axes fraction',
ha='right',
va="center",
fontsize=10)
fig.tight_layout()
plt.show()
plt.clf()
gs=plt.GridSpec(2, 3)
fig=plt.figure(figsize=(11,6))
fig.suptitle('Categorical Counts-2', fontsize=8)
ax1=fig.add_subplot(gs[0, 0])
ax2=fig.add_subplot(gs[0, 1])
ax3=fig.add_subplot(gs[0,2])
ax4=fig.add_subplot(gs[1, 0])
ax5=fig.add_subplot(gs[1, 1])
ax6=fig.add_subplot(gs[1,2])
#plt.title('Type of Incident',fontsize=7, y=1)
c1=sns.countplot(data = new_fraud_v4, x = 'Witnesses', ax=ax1)
c1.tick_params(axis='both', which='major', labelsize=4)
c1.set_xlabel('Witnesses', fontsize=5)
c1.set_ylabel("Count",fontsize=5)
#plt.title('Type of Collision',fontsize=7, y=1)
c2=sns.countplot(data=new_fraud_v4, x='BodilyInjuries', ax=ax2)
c2.tick_params(axis='both', which='major', labelsize=5)
c2.set_xlabel("Bodily Injuries", fontsize=5)
c2.set_ylabel("Count",fontsize=4)
#plt.title('Reported Fraud',fontsize=7, y=1)
c3=sns.countplot(data=new_fraud_v4, x='PropertyDamage', ax=ax3)
c3.tick_params(axis='both', which='major', labelsize=5)
c3.set_xlabel("Property Damage", fontsize=5)
c3.set_ylabel("Count", fontsize=5)
c4=sns.countplot(data = new_fraud_v4, x = 'NumberOfVehicles', ax=ax4)
c4.tick_params(axis='both', which='major', labelsize=4)
c4.set_xlabel("Number Of Vehicles", fontsize=5)
c4.set_ylabel("Count",fontsize=5)
#plt.title('Type of Collision',fontsize=7, y=1)
c5=sns.countplot(data=new_fraud_v4, x='IncidentState', ax=ax5)
c5.tick_params(axis='both', which='major', labelsize=5)
c5.set_xlabel("Incident State", fontsize=5)
c5.set_ylabel("Count",fontsize=4)
#plt.title('Reported Fraud',fontsize=7, y=1)
c6=sns.countplot(data=new_fraud_v4, x='AuthoritiesContacted', ax=ax6)
c6.tick_params(axis='both', which='major', labelsize=5)
c6.set_xlabel("Authorities Contacted", fontsize=5)
c6.set_ylabel("Count", fontsize=5)
plt.tight_layout()
plt.show()
plt.clf()
new_fraud_v5=new_fraud_v4.copy()
new_fraud_v5['Witnesses']=new_fraud_v5['Witnesses'].cat.remove_categories("MISSINGVALUE")
new_fraud_v5=new_fraud_v5.drop(['PropertyDamage'], axis=1)
plt.figure(figsize=(14,8))
#plt.title("Type of Collision-Changed")
ax=sns.countplot(data=new_fraud_v5, x='Witnesses')
#plt.tick_params(label_rotation=45)
ax.set_title("Witnesses-Changed", size=20)
ax.set(xlabel=None)
ax.set(ylabel=None)
ax.tick_params(axis='both', which='major', labelsize=14)
sns.set_style("dark")
ax.annotate('Figure ##',
xy = (1.0, -0.2),
xycoords='axes fraction',
ha='right',
va="center",
fontsize=10)
fig.tight_layout()
plt.show()
plt.clf()
new_fraud_v5.isna().sum() > 0
## CustomerID False
## TypeOfIncident False
## TypeOfCollission False
## SeverityOfIncident False
## AuthoritiesContacted False
## IncidentState False
## IncidentCity False
## IncidentAddress False
## NumberOfVehicles False
## BodilyInjuries False
## Witnesses True
## PoliceReport False
## AmountOfTotalClaim True
## AmountOfInjuryClaim False
## AmountOfPropertyClaim False
## AmountOfVehicleDamage False
## InsuredAge False
## InsuredZipCode False
## InsuredGender True
## InsuredEducationLevel False
## InsuredOccupation False
## InsuredHobbies False
## CapitalGains False
## CapitalLoss False
## Country True
## InsurancePolicyNumber False
## CustomerLoyaltyPeriod False
## InsurancePolicyState False
## Policy_CombinedSingleLimit False
## Policy_Deductible False
## PolicyAnnualPremium False
## UmbrellaLimit False
## InsuredRelationship False
## coverageIncidentDiff False
## dayOfWeek False
## IncidentPeriodDay True
## dtype: bool
From the above output we find there are features containing null (missing) values. Before we remove any missing values, we’ll droop features that will not be used in our models.
new_fraud_v6=new_fraud_v5.copy()
new_fraud_v6=new_fraud_v6.drop(['CustomerID', 'IncidentAddress', 'InsuredZipCode', 'InsuredHobbies','Country', 'InsurancePolicyNumber', 'IncidentCity','AmountOfInjuryClaim', 'AmountOfPropertyClaim', 'AmountOfVehicleDamage',
'InsuredEducationLevel','InsuredOccupation','PoliceReport'], axis=1)
new_fraud_v6.isna().sum()
## TypeOfIncident 0
## TypeOfCollission 0
## SeverityOfIncident 0
## AuthoritiesContacted 0
## IncidentState 0
## NumberOfVehicles 0
## BodilyInjuries 0
## Witnesses 12
## AmountOfTotalClaim 8
## InsuredAge 0
## InsuredGender 8
## CapitalGains 0
## CapitalLoss 0
## CustomerLoyaltyPeriod 0
## InsurancePolicyState 0
## Policy_CombinedSingleLimit 0
## Policy_Deductible 0
## PolicyAnnualPremium 0
## UmbrellaLimit 0
## InsuredRelationship 0
## coverageIncidentDiff 0
## dayOfWeek 0
## IncidentPeriodDay 107
## dtype: int64
new_fraud_v7=new_fraud_v6.copy()
new_fraud_v7=new_fraud_v7.dropna()
new_fraud_v7.isna().sum()
## TypeOfIncident 0
## TypeOfCollission 0
## SeverityOfIncident 0
## AuthoritiesContacted 0
## IncidentState 0
## NumberOfVehicles 0
## BodilyInjuries 0
## Witnesses 0
## AmountOfTotalClaim 0
## InsuredAge 0
## InsuredGender 0
## CapitalGains 0
## CapitalLoss 0
## CustomerLoyaltyPeriod 0
## InsurancePolicyState 0
## Policy_CombinedSingleLimit 0
## Policy_Deductible 0
## PolicyAnnualPremium 0
## UmbrellaLimit 0
## InsuredRelationship 0
## coverageIncidentDiff 0
## dayOfWeek 0
## IncidentPeriodDay 0
## dtype: int64
All null values have been removed from our data.
new_fraud_v7[new_fraud_v7['PolicyAnnualPremium']==-1].shape
## (47, 23)
-1 level in Policy Annual Premium represents missing values.
new_fraud_v8=new_fraud_v8[new_fraud_v8['PolicyAnnualPremium']!=-1]
new_fraud_v8[new_fraud_v8['PolicyAnnualPremium']==-1].shape
## (0, 23)
The PolicyAnnualPremium feature now has no missing features (category -1).
First step is to confirm our new data new_fraud_v8 has the same columns as our training data. We’ll then check if the levels of our categorical columns of both data frames are equal.
new_data=new_fraud_v8.copy()
print("X_train and new_data columns are equal:",X_train.columns.equals(new_data.columns))
## X_train and new_data columns are equal: True
categorical, numerical=define_columns(new_data)
## Categorical Features: ['TypeOfIncident', 'TypeOfCollission', 'SeverityOfIncident', 'AuthoritiesContacted', 'IncidentState', 'NumberOfVehicles', 'BodilyInjuries', 'Witnesses', 'InsuredGender', 'InsurancePolicyState', 'Policy_CombinedSingleLimit', 'InsuredRelationship', 'dayOfWeek', 'IncidentPeriodDay']
## numerical Features: ['AmountOfTotalClaim', 'InsuredAge', 'CapitalGains', 'CapitalLoss', 'CustomerLoyaltyPeriod', 'Policy_Deductible', 'PolicyAnnualPremium', 'UmbrellaLimit', 'coverageIncidentDiff']
Let’s check that the levels from the new data new_fraud_v8 are the same as those categorical levels from the X_train data.
def assert_categorical_levels_match(df1, df2, categorical_columns):
"""
Checks if the unique values (levels) of categorical columns in two DataFrames match.
Parameters:
df1 (pd.DataFrame): First DataFrame
df2 (pd.DataFrame): Second DataFrame
categorical_columns (list): List of categorical column names to compare
Raises:
AssertionError: If any categorical column has mismatched levels between the two DataFrames.
"""
for col in categorical_columns:
levels_df1 = set(df1[col].unique())
levels_df2 = set(df2[col].unique())
assert levels_df1 == levels_df2, f"Mismatch in column '{col}': {levels_df1 ^ levels_df2}"
print("All categorical column levels match.")
assert_categorical_levels_match(X_train, new_data,categorical)
## All categorical column levels match.
Now that we have confirmed categorical levels of both data frames match we can now transform them in preparation of fitting our model.
X_train_tr, X_new=transform_x_columns_tr(X_train, new_data)
## Rows of First three Columns
## num__AmountOfTotalClaim num__InsuredAge num__CapitalGains
## 0 0.639019 -1.489358 1.199641
## 1 0.121226 0.141468 1.210473
## 2 0.289220 0.016020 0.260888
## 3 -1.870518 -0.109428 1.636523
## 4 -0.694800 -1.238462 0.430586
## Test Model
import unittest
model=xgb_base_clf
X_test=X_new.copy()
Our next step is to test the XGBoost classifier on our new data to ensure that it returns the expected outputs of [0,1] and returns probabilities bewtween 0 and 1.
test_results = []
# Define a test class with CSV logging
class TestModelInference(unittest.TestCase):
def setUp(self):
self.model = model
self.X_test = X_test
def test_prediction_output_values(self):
"""Test that model predictions contain only valid class labels."""
start_time = time.time()
pred = self.model.predict(self.X_test)
unique_values = np.unique(pred)
for value in unique_values:
self.assertIn(value, [0, 1])
elapsed_time = time.time() - start_time
test_results.append(["Prediction Output Values", "Pass", round(elapsed_time, 4)])
def test_prediction_probabilities(self):
"""Test that the model returns valid probability values between 0 and 1."""
start_time = time.time()
prob_pred = self.model.predict_proba(self.X_test)
self.assertTrue(np.all((prob_pred >= 0) & (prob_pred <= 1)), "Probabilities must be between 0 and 1")
self.assertTrue(np.allclose(prob_pred.sum(axis=1), 1, atol=1e-6), "Sum of probabilities must be close to 1")
elapsed_time = time.time() - start_time
test_results.append(["Prediction Probabilities", "Pass", round(elapsed_time, 4)])
def test_prediction_time(self):
"""Test that the model predicts within an acceptable time limit."""
start_time = time.time()
_ = self.model.predict(self.X_test)
elapsed_time = time.time() - start_time
self.assertLess(elapsed_time, 1, f"Prediction took too long: {elapsed_time:.4f} seconds")
test_results.append(["Prediction Time", "Pass" if elapsed_time < 1 else "Fail", round(elapsed_time, 4)])
# Run tests and capture results
if __name__ == "__main__":
print("\n===== Running Model Tests =====\n")
# Redirect unittest output to a buffer
test_buffer = StringIO()
runner = unittest.TextTestRunner(stream=test_buffer, verbosity=2)
unittest.main(argv=['model;, x_test'], exit=False, testRunner=runner)
# Convert test results to a DataFrame
df_results = pd.DataFrame(test_results, columns=["Test Name", "Status", "Execution Time (s)"])
# Save results to CSV
df_results.to_csv("test_results.csv", index=False)
print("\nTest results saved to test_results.csv")
##
## ===== Running Model Tests =====
##
## <unittest.main.TestProgram object at 0x3607c8130>
##
## Test results saved to test_results.csv
| Test.Name | Status | Execution.Time..s. |
|---|---|---|
| Prediction Output Values | Pass | 0.0041 |
| Prediction Probabilities | Pass | 0.0030 |
| Prediction Time | Pass | 0.0037 |
Our classifier passed the tests. Now we will use the classifier to predict on the new data.
xg_predictions=xgb_base_clf.predict(X_new)
xg_results_df = new_data.copy()
xg_results_df["Predicted_Label"] = xg_predictions
Let’s compare our predicted labels(fraud) of the new data to the reported fraud in our original data.
## Predicted Fraud Percentages from New Data:
## 0 0.85
## 1 0.15
## Name: Predicted_Label, dtype: float64
## Reported Fraud Percentages from Original Data:
## N 0.73
## Y 0.27
## Name: ReportedFraud, dtype: float64
The results show that our model predicted 15% of our observations as fraud compared to 27% of the original data reported as fraud. This is a 12% difference. Let’s check the prediction results on our original test data.
xg_test_predictions=xgb_base_clf.predict(X_test_tr)
xg_test_results_df = X_test_tr.copy()
xg_test_results_df["Predicted_Label"] = xg_test_predictions
test_count=xg_test_results_df["Predicted_Label"].value_counts(normalize=True).round(2)
## Reported Fraud Percentages from Original Data:
## 0 0.77
## 1 0.23
## Name: Predicted_Label, dtype: float64
The predicted labels of the test data are closer to the the original data than the those from the new data.
Let’s take a look at the predicted probabilities for both the new and test data.
xg_probs_new=xgb_base_clf.predict_proba(X_new)
xg_probs_df = pd.DataFrame(xg_probs_new, columns=['fraud_no', 'fraud_yes'])
## First Five Rows of xg_probs_df
## fraud_no fraud_yes
## 0 0.990794 0.009206
## 1 0.877024 0.122976
## 2 0.873884 0.126116
## 3 0.909902 0.090098
## 4 0.926205 0.073795
xg_probs_test=xgb_base_clf.predict_proba(X_test_tr)
xg_test_probs_df = pd.DataFrame(xg_probs_test, columns=['fraud_no', 'fraud_yes'])
## First Five Rows of xg_test_probs_df
## fraud_no fraud_yes
## 0 0.149526 0.850474
## 1 0.932576 0.067424
## 2 0.027026 0.972974
## 3 0.922639 0.077361
## 4 0.956429 0.043571
xg_probs_df["fraud_yes"].hist()
plt.title("Distribution Predicted Fraud Probabilities on New Data")
plt.show()
plt.clf()
xg_test_probs_df["fraud_yes"].hist()
plt.title("Distribution Predicted Fraud Probabilities on Test Data")
plt.show()
plt.clf()
The distributions appear similar. We are interested in predicted probabilities of 50% or greater. The xg boost classifier appears to be marginally stronger at predicting reported fraud of yes on the original test data at probabilities of 70% or higher. We’ll filter our data to observe just predicted probabilities of fraud over 70%.
xg_probs_ovr_sventyPct=xg_probs_ovr_sventyPct[xg_probs_ovr_sventyPct.fraud_yes >= 0.7]
new_probs_PctOvrSvnty=len(xg_probs_ovr_sventyPct)/len(xg_probs_df)*100
xg_test_probs_ovr_svntyPct=xg_test_probs_df.copy()
xg_test_probs_ovr_svntyPct=xg_test_probs_ovr_svntyPct[xg_test_probs_ovr_svntyPct.fraud_yes >=0.7]
test_probs_PctOvrSvnty=len(xg_test_probs_ovr_svntyPct) / len(xg_test_probs_df)*100
## Among cases correctly predicted as fraud (Yes), 12.30% of the predictions in the new data had
## a predicted probability greater than 70%, compared to 18.14% in the test data.
It appears the xg boost classifier predicted a higher percentage of fraud on the original test set than on our new data. Assessment on the xg boost classifier would benefit from collecting additional data before a definitive judgement can be made.